## André Greiner-Petter

# Making Presentation Math Computable

A Context-Sensitive Approach for Translating LaTeX to Computer Algebra Systems

Making Presentation Math Computable

André Greiner-Petter

# Making Presentation Math Computable

A Context-Sensitive Approach for Translating LaTeX to Computer Algebra Systems

André Greiner-Petter Berlin, Germany

ISBN 978-3-658-40472-7 ISBN 978-3-658-40473-4 (eBook) https://doi.org/10.1007/978-3-658-40473-4

© The Editor(s) (if applicable) and The Author(s) 2023. This book is an open access publication. **Open Access** This book is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this book are included in the book's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the book's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use.

The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This Springer Vieweg imprint is published by the registered company Springer Fachmedien Wiesbaden GmbH, part of Springer Nature.

The registered company address is: Abraham-Lincoln-Str. 46, 65189 Wiesbaden, Germany




### CHAPTER 3


### CHAPTER 6


### **Front Mater**

### **List of Figures**


### **Front Mater**

### **List of Tables**


### **FRONT MATTER**

### **Abstract**

This thesis addresses the issue of translating mathematical expressions from LATEX to the syntax of Computer Algebra Systems (CAS). Over the past decades, especially in the domain of Science, Technology, Engineering, and Mathematics (STEM), LATEX has become the de-facto standard to typeset mathematical formulae in publications. Since scientists are generally required to publish their work, LATEX has become an integral part of today's publishing workfow. On the other hand, modern research increasingly relies on CAS to simplify, manipulate, compute, and visualize mathematics. However, existing LATEX import functions in CAS are limited to simple arithmetic expressions and are, therefore, insufcient for most use cases. Consequently, the workfow of experimenting and publishing in the Sciences often includes time-consuming and error-prone manual conversions between presentational LATEX and computational CAS formats.

To address the lack of a reliable and comprehensive translation tool between LATEX and CAS, this thesis makes the following three contributions.

First, it provides an approach to semantically enhance LATEX expressions with sufcient semantic information for translations into CAS syntaxes. This, so called, *semantifcation* process analyzes the structure of the formula and its textual context to conclude semantic information. The research for this semantifcation process additionally contributes towards related Mathematical Information Retrieval (MathIR) tasks, such as mathematical education assistance, math recommendation and question answering systems, search engines, automatic plagiarism detection, and math type assistance systems.

Second, this thesis demonstrates the frst context-aware LATEX to CAS translation framework L ACAST. LACAST uses the developed semantifcation approach to transform LATEX expressions into an intermediate semantic LATEX format, which is then further translated to CAS based on translation patterns. These patterns were manually crafted by mathematicians to assure accurate and reliable translations. In comparison, this thesis additionally elaborates a noncontext aware neural machine translation approach trained on a mathematical library generated by Mathematica.

Third, the thesis provides a novel approach to evaluate the performance for LATEX to CAS translations on large-scaled datasets with an automatic verifcation of equations in digital mathematical libraries. This evaluation approach is based on the assumption that equations in digital mathematical libraries can be computationally verifed by CAS, if a translation between both systems exists. In addition, the thesis provides an in-depth manual evaluation on mathematical articles from the English Wikipedia.

The presented context-aware translation framework LACAST increases the efciency and reliability of translations to CAS. Via LACAST, we strengthened the Digital Library of Mathematical Functions (DLMF) by identifying numerous of issues, from missing or wrong semantic annotations to sign errors. Further, via LACAST, we were able to discover several issues with the commercial CAS Maple and Mathematica. The fundamental approaches to semantically enhance mathematics developed in this thesis additionally contributed towards several related MathIR tasks. For instance, the large-scale analysis of mathematical notations and the studies on math-embeddings motivated new approaches for math plagiarism detection systems, search engines, and allow typing assistance for mathematical inputs. Finally, LACAST translations will have a direct realworld impact, as they are scheduled to be integrated into upcoming versions of the DLMF and Wikipedia.

### **FRONT MATTER**

### **Zusammenfassung**

Diese Dissertation befasst sich mit der Problematik von Übersetzungen mathematischer Formeln zwischen LATEX und Computeralgebrasystemen (CAS). Im Laufe des digitalen Zeitalters wurde LATEX zum Quasistandard für das Schreiben mathematischer Formeln auf dem Computer, insbesondere in den Disziplinen Mathematik, Informatik, Naturwissenschaften und Technik (MINT). Da Wissenschaftler gemeinhin ihre Arbeit publizieren, ist LATEX zu einem integralen Bestandteil moderner Forschung geworden. Gleichermaßen verlassen sich Wissenschaftler immer mehr auf die Möglichkeiten moderner CAS, um efektiv mit mathematischen Formeln zu arbeiten, zum Beispiel, indem sie diese umformen, lösen oder auch visualisieren. Die momentanen Ansätze, welche eine Übersetzung von LATEX zu CAS erlauben, wie beispielsweise interne Import-Funktionen einiger CAS, sind jedoch häufg auf einfache arithmetische Ausdrücke beschränkt und daher nur wenig hilfreich im realen Arbeitsalltag. Infolgedessen ist die Arbeit moderner Wissenschaftler in den MINT Disziplinen häufg geprägt von zeitraubenden und fehleranfälligen manuellen Übersetzungen zwischen LATEX und CAS.

Die vorliegende Dissertation leistet die folgenden Beiträge, um das Problem des Übersetzens von mathematischen Ausdrücken zwischen LATEX und CAS zu lösen.

Zunächst ist LATEX ein Format, welches lediglich die visuelle Präsentation mathematischer Ausdrücke kodiert, nicht jedoch deren semantische Informationen. Die semantischen Informationen sind jedoch notwendig für CAS, welche keine mehrdeutigen Eingaben erlauben. Daher führt die vorliegende Arbeit als ersten Schritt für eine Übersetzung eine sogenannte Semantifzierung mathematischer Ausdrücke ein. Diese Semantifzierung extrahiert semantische Informationen aus dem Kontext und den Bestandteilen der Formel, um Rückschlüsse auf ihre Bedeutung zu ziehen. Da die Semantifzierung eine klassische Aufgabe auf dem Gebiet der mathematischen Informationsgewinnung darstellt, leistet dieser Teil der Dissertation auch Beiträge zu verwandten Themengebieten. So sind die hier vorgestellten Ansätze auch nützlich für pädagogische Programme, Frage-Antwort Systeme, Suchmaschinen und die digitale Plagiatserkennung.

Als zweiten Beitrag, stellt die vorliegende Dissertation das erste kontextbezogene LATEX zu CAS Übersetzungsprogramm vor, genannt LACAST. LACAST nutzt die zuvor eingeführte Semantifzierung, um LATEX in ein Zwischenformat zu transformieren, welches die semantischen Informationen explizit darstellt. Dieses Format wird semantisches LATEX genannt, da es eine technische Erweiterung von LATEX ist. Die weitere Übersetzung zu CAS wird durch heuristische Übersetzungsmuster für mathematische Funktionen realisiert. Diese Übersetzungsmuster wurden in Zusammenarbeit mit Mathematikern defniert, um eine korrekte Übersetzung in diesem letzten Schritt zu gewährleisten. Um die Vorzüge einer kontextbezogenen Übersetzung besser zu verstehen, stellt diese Arbeit zum Vergleich auch eine Maschinenübersetzung auf neuronalen Netzen vor, welche den Kontext einer Formel nicht berücksichtigt.

Der dritte Beitrag dieser Dissertation führt eine neue Methode zur Evaluierung von mathematischen Übersetzungen ein, welche es erlaubt, auch eine große Anzahl an Übersetzungen auf ihre Korrektheit hin zu überprüfen. Diese Methode folgt dem Ansatz, dass Gleichungen in mathematischen Bibliotheken auch nach der Übersetzung in ein CAS noch korrekt sein müssten. Ist dies nicht der Fall, ist entweder die Ausgangsgleichung, die Übersetzung, oder das CAS fehlerhaft. Hierbei ist zu beachten, dass jede Fehlerquelle einen Mehrwert für das jeweilige System darstellt. Zusätzlich zu dieser automatischen Evaluierung, erfolgt noch eine manuelle Analyse von Übersetzungen auf Basis englischer Wikipedia Artikel.

Zusammenfassend ermöglicht das kontextbezogene Übersetzungsprogramm LACAST eine efzientere Arbeitsweise mit CAS. Mit Hilfe dieser Übersetzungen konnten auch mehrere Probleme, wie falsche Informationen oder Vorzeichenfehler, in der Digital Library of Mathematical Functions (DLMF) sowie Fehler in den kommerziell vertriebenen CAS Maple und Mathematica automatisch aufgedeckt und behoben werden.

Die hier vorgestellte Grundlagenforschung zum semantischen Anreichern mathematischer Ausdrücke, hat zudem etliche Beiträge zu verwandten Forschungsthemen geleistet. Zum Beispiel hat die Analyse der Verteilung von mathematischen Notationen in großen Datensätzen neue Ansätze in der digitalen Plagiatserkennung ermöglicht. Des Weiteren wird zurzeit daran gearbeitet, die Übersetzungen von LACAST in kommende Versionen von Wikipedia und der DLMF zu integrieren.

### **FRONT MATTER**

### **Acknowledgements**

This thesis would not have been possible without the tremendous help and support from numerous family members, friends, colleagues, supervisors, and several international institutions. In the following, I want to take the opportunity to thank all the individuals and organizations that helped me along the way to make this work possible.

My frst sincere wishes go to my prodigious doctoral advisers Bela Gipp and Akiko Aizawa. Their continuous support and counsel enabled me to realize this thesis at marvelous places and together with numerous wonderful people from all over the world. Their enduring encouragement and assistance, Bela's abiding and infectious positivity, and Akiko's steadfast and kind endorsement empowered my personal and professional life. Both of their competent and sincere guidance helped me to fnd my way in the intricate maze of research and career decisions and turned my often onerous time into a joyful and memorable experience.

Moreover, I am very grateful to my adviser and friend Moritz Schubotz, who supported and guided me throughout the entire time of my doctoral thesis and even beyond. Our fruitful and always engaging discussions, even when exhausting, enriched and positively afected most, if not all, of my work. It is not an exaggeration to admit that my career, including my Master's thesis and this doctoral thesis, would not have been possible and nearly as successful and joyful as it has been without his continuous and sincere support over the years. I am wholeheartedly thankful for all the years we worked together.

I further wish to gratefully acknowledge my friends, colleagues, and advisers Howard Cohl, Abdou Youssef, and Bruce Miller at the National Institute of Standards and Technology (NIST) for their valuable advice, continuous drive to perfection, and our rewarding collaborations. I thank Jürgen Gerhard at Maplesoft, who kindly provided me access and support for Maple on several occasions. I am just as thankful for the assistance and support from Norman Meuschke, who always helped me to overcome governmental and organizational hurdles, Corinna Breitinger, who never failed to reft my gibberish, and my colleagues and friends Terry Lima Ruas and Philipp Scharpf for many visionary discussions. I also thank all my collaborators and colleagues with whom I had the distinct opportunity to work together, including Takuto Asakura, Fabian Müller, Olaf Teschke, William Grosky, Marjorie McClain, Yusuke Miyao, Malte Ostendorf, Bonita Saunders, Kenichi Iwatsuki, Takuma Udagawa, Anastasia Zhukova, and Felix Hamborg. I further want to thank the students I worked with, including Avi Trost, Rajen Dey, Joon Bang, Kevin Chen, and Felix Petersen. I especially appreciate the help and assistance from people at the National Institute of Informatics (NII) to overcome governmental and daily life issues. I wish to especially thank Rie Ayuzawa, Noriko Katsu, Akiko Takenaka, and Goran Topic.

My genuine gratitude also goes to my host organizations and those that provided fnancial support for my research. I am thankful for the German Academic Exchange Program (DAAD) for enabling two research stays at the NII in Tokyo, the NII for providing me a wonderful work environment, the German Research Foundation (DFG) to fnancially support many of my projects, the NIST for hosting me as a guest researcher, and Maplesoft for ofering me an internship during my preliminary research project on the Digital Library of Mathematical Functions (DLMF). I fnally thank the ACM Special Interest Group on Information Retrieval (SIGIR), the University of Konstanz, the University of Wuppertal, and Maplesoft for supporting several conference participations.

My last and most crucial gratitude goes to my family and friends, who always cheered me in good and bad times and constantly backed and supported me so that I could selfshly pursue my dreams. I am deeply grateful for my lovely parents Rolf & Regina, who have always been on my side and make all this possible behind the scenes. I am also tremendously thankful for the enduring personal support from my dear friends Kevin, Lena, Vici, Dong, Peter, Vitor, Ayuko, and uncountably more. Finally, I thank my lovely partner Aimi for brightening even the darkest times and pushing every possible obstacle aside. I dedicate this thesis to my lovely parents, my dear friends, and my enchanting girlfriend.

*I went to the woods because I wanted to live deliberately. I wanted to live deep and suck out all the marrow of life. To put to rout all that was not life; and not, when I had come to die, discover that I had not lived.*

Neil Perry - *Dead Poet Society*

### **CHAPTER 1**

### **Introduction**


This thesis addresses the issue of translating mathematical expressions from LATEX to the syntax of Computer Algebra Systems (CAS), which is typically a time-consuming and error-prone task in the modern life of many researchers. A reliable and comprehensive translation approach requires analyzing the textual context of mathematical formulae. In turn, research advances in translating LATEX contribute directly towards related tasks in the Mathematical Information Retrieval (MathIR) arena. In this chapter, I provide an introduction to the topic. Section 1.1 introduces my motivation and provides an overview of the problem. Section 1.2 summarizes the research gap. In Section 1.3, I defne the research objective and research tasks of this thesis. Section 1.4 concludes with an outline of the thesis including an overview of the publications that contributed to the goals of this thesis and the research path that led to these publications.

### **1.1 Motivation & Problem**

Consider a researcher is working on Jacobi polynomials and examines the existing English Wikipedia article about the topic1. While she might be familiar with the Digital Library of Mathematical Functions (DLMF) [98], a standard resource for Orthogonal Polynomials and Special Functions (OPSF), the equation 1.1 from the article might be new to her

$$P\_n^{(\alpha,\beta)}(x) = \frac{\Gamma(\alpha+n+1)}{n!\,\Gamma(\alpha+\beta+n+1)} \sum\_{m=0}^n \binom{n}{m} \frac{\Gamma(\alpha+\beta+n+m+1)}{\Gamma(\alpha+m+1)} \left(\frac{z-1}{2}\right)^m. \tag{1.1}$$

In order to analyze this new equation, e.g., to validate it, she wants to use CAS. CAS are powerful mathematical software tools with numerous applications [207]. Today's most widely

© The Author(s) 2023 A. Greiner-Petter, *Making Presentation Math Computable*, https://doi.org/10.1007/978-3-658-40473-4\_1

<sup>1</sup> https://en.wikipedia.org/wiki/Jacobi\_polynomials [accessed 2021-10-01]. Hereafter, dates follow the ISO 8601 standard. i.e., YYYY-MM-DD.

**Supplementary Information** The online version contains supplementary material available at https://doi.org/10.1007/978-3-658-40473-4\_1.


Table 1.1: Diferent representations of a Jacobi polynomial.

used CAS include Maple [36], Mathematica [393], and MATLAB [246]. Scientists use CAS2 to simplify, manipulate, evaluate, compute, or even visualize mathematical expressions. Thus, CAS play a crucial role in the modern era for pure and applied mathematics [8, 184, 207, 262] and even found their way into classrooms [237, 363, 365, 389, 390]. In turn, CAS are the perfect tool for the researcher in our example to examine the formula further. In order to use a CAS, she needs to translate the expression into the correct CAS syntax.

Table 1.1 illustrates the diferences between computable and presentational encodings for a Jacobi polynomial. While the rendered version and the LATEX [220] encoding only provide visual information, semantic LATEX [403] and the CAS encodings explicitly encode the meaning, i.e., the semantics, of the formula. On the one hand, LATEX<sup>3</sup> has become the de-facto standard to typeset mathematics in scientifc publications [129, 248, 402], especially in the domain of Science, Technology, Engineering, and Mathematics (STEM). On the other hand, computational advances make CAS an essential asset in the modern workfow of experimenting and publishing in the Sciences. Translating expressions between LATEX and CAS syntaxes is, therefore, a typical task in the everyday life of our hypothetical researcher. Despite this common need, no reliable translation from a presentational format, such as LATEX, to a computable format, such as Mathematica, is available to date. The only option our hypothetical researcher has is to manually translate the expression in the specifc syntax of a CAS. This process is time-consuming and often error-prone.

 **Problem:** No reliable translation from a presentational mathematical format to a computable mathematical format exists to date.

If a translation between LATEX and CAS is so essential, why are there no translation tools available? As is often the case in research, the reasons for this are diversifed. First, there are translation approaches available. Some CAS, such as Mathematica and SymPy, allow to import L ATEX expressions. Most CAS support at least the Mathematical Markup Language (MathML), since it is the current web standard to encode mathematical formulae. With numerous tools available to transfer LATEX to MathML [18], a translation from LATEX to CAS syntaxes should not be a difcult task. However, none of these available translation techniques are reliable


<sup>2</sup> In the sequel, the acronym CAS is used interchangeably with its plural. 3

https://www.latex-project.org/ [accessed 2021-10-01]



and comprehensive. Table 1.2 illustrates how Mathematica, one of the major proprietary CAS, fails to import even simple formulae. Another option is SnuggleTeX [251], a LATEX to MathML converter which also supports translations to Maxima [324]. SnuggleTeX fail to translate all expressions in Table 1.2. Alternative translations via MathML as an intermediate format perform similarly (as we will show later in Section 2.3).

While the simple cases shown in Table 1.2 could be solved with a more comprehensive and fexible parser and mapping strategy, such a solution would ignore the real challenge of translating mathematics to CAS, the ambiguity. The interpretation of the majority of mathematical expressions is *context-dependent*, i.e., the same formula may refer to diferent concepts in diferent contexts. Take the expressions *π*(*x* + *y*) as an example. In number theory, the expression most likely refers to the number of primes less than or equal to *x* + *y*. In another context, however, it may just refer to a multiplication *πx* + *πy*. Without considering the context, an appropriate translation of this ambiguous expression is infeasible. Today's translation solutions, however, do not consider the context of an input. Instead, they translate the expression based on internal decisions, which are often not transparent to a user.

Table 1.3 shows the results of importing *π*(*x* + *y*) to diferent CAS. Each CAS in Table 1.3 interprets *π* as a function call but does not associate it with the prime counting function (nor any other predefned function). Only SnuggleTeX translated *π* as the mathematical constant to Maxima syntax. However, Maxima does not contain a prime counting function. The CAS import functions consider the expression as a generic function with the name *π*. Mathematica surprisingly links *π* still with the mathematical constant which results in a peculiar behaviour for numeric evaluations. The expression N[Pi[x+y]] (numeric evaluation of the imported expression) is evaluated to 3*.*14159[*x* + *y*]. Associating the variables *x* and *y* with numbers, say *x, y* = 1, would result in the rather odd expression 3*.*14159[2].

Table 1.3: The results of importing *π*(*x* + *y*) in diferent CAS. For Maple, a MathML representation was used. Content MathML was not tested, since there is no content dictionary available that defnes the prime counting function. SnuggleTeX translated the expression to the CAS Maxima. The two right most columns show the expected expressions in the context of the prime counting function or a multiplication. None of the CAS choose any of the two expected interpretations. Note that the prime counting function in Maple can also be written with pi(x+y) and requires to pre-load the extra package NumberTheory. Nonetheless, this function pi(x+y) is still diferent to the actual imported expression Pi(x+y). Note further that Maxima does not defne a prime counting function.


Why do existing translation techniques not allow to specify a context? Mainly because it is an open research question of what this context is or needs to be. The exact information needs to perform translation to CAS syntaxes, and where to fnd them is unlcear [11]. Some required information is indeed encoded in the structure of the expression itself. Consider a simple fraction <sup>1</sup> <sup>2</sup> . This expression is context-independent and can be directly translated. The expression *<sup>P</sup>*(*α,β*) *<sup>n</sup>* (*x*) in the context of OPSF is also often unambiguous for general-purpose CAS. Since Mathematica supports no other formula with this presentational structure, i.e., *P* followed by a subscript and superscript with paranthesis, Mathematica is able to correctly associate *<sup>P</sup>*(•*,*•) • (•), where • are wildcards, with the function JacobiP. In other cases, the immediate textual context of the formula provides sufcient information to disambiguate the expression [54, 329]. Consider, an author explicitly declares *π*(*x*) as the prime counting function right before she uses it with *π*(*x*+*y*). In this case, it might be sufcient to scan the surrounding context for key phrases [183, 214, 329], like 'prime counting function' in order to map *π* to, for instance, NPrimes in Mathematica.

Often, the semantic explanations of mathematical objects in an article are scattered around in the context or absent entirely [394]. An interested reader needs to retrieve sufcient semantic explanations and correctly link them with mathematical objects in order to comprehend the meaning of a complex formula. Sometimes, an author presumes the interpretation of an expression can be considered as common knowledge and, therefore, does not require further explanations. Consider *π*(*x* + *y*) refers to a multiplication between *π* and (*x* + *y*). In general, an author may consider *π* (the mathematical constant) as common knowledge and does not explicitly declare its meaning. The same could be true for scientifc articles, where the length is often limited. An article about prime numbers probably not explicitely declare the meaning of *π*(*x* + *y*) because the author presumes the semantics are unambiguis given the overall context of the article.

In other cases, the information needs go beyond a simple text analysis. Consider *π*(*x* + *y*) as a generic function that was previously defned in the article and simply has no name. An appropriate translation would require to retrieve the defnition of the function from the context. But even if a function is well-known and supported by a CAS, a direct translation might be inappropriate because the defnition in the CAS is not what our researcher expected [3, 13]. Legendre's incomplete elliptic integral of the frst kind *F*(*φ, k*), for example, is defned with the amplitude *φ* as its frst argument in the DLMF and Mathematica. In Maple, however, one needs to use the sine of the amplitude sin(*φ*) for the frst argument4. In turn, an appropriate translation to Maple might be EllipticF(sin(phi), k) rather than EllipticF(phi, k) depending on the source of the original expression. The English Wikipedia article about elliptic integrals<sup>5</sup> contains both versions and refers to them with *F*(*φ, k*) and *F*(*x*; *k*) respectively. Even though both versions in Wikipedia refer to the same function, correct translations to Maple of *F*(*φ, k*) and *F*(*x*; *k*) are not the same.

In cases of multi-valued functions, translations between diferent systems can become eminently more complex [83, 91, 172]. Even for simple cases, such as the arccotangent function arccot(*x*), the behavior of diferent CAS might be confusing. For example, since arccot(*x*) is multi-valued, there are multiple solutions of arccot(−1). CAS, like any general calculator too, only compute values on the principle branches and, therefore, return only a single value. The principle branches, however, are not necessarily uniformly positioned among multiple systems [84, 172]. In turn, the returned value of a multi-valued function may depends on the system, see Table 1.4. A translation of arccot(*x*) from the DLMF to arccot(x) in Maple would be only

Table 1.4: Diferent computation results for arccot(−1) (inspired by [84]).


correct for *x >* 0. Finally, CAS may also compute irrational looking expressions without objections, e.g., arccot<sup>1</sup> 0 returns 1*.*5708 in MATLAB6. Even for feld experts, it can be challenging to keep track of every property and characteristic of CAS [20, 100].

 **Problem:** Existing LATEX to CAS converters are context-agnostic, infexible, limited to simple expressions, and nontransparent.

In combination, all of the issues underline that an accurate manual translation to the syntax of CAS is challenging, time-consuming, error-prone, and requires deep and substantial knowledge about the target system. Especially with the increasing complexity of the translated expressions, errors during the translation process might be inevitable. Real-world scenarios often include

[accessed 2021-10-01]


<sup>4</sup> https://www.maplesoft.com/support/help/maple/view.aspx?path=EllipticF

<sup>5</sup> https://en.wikipedia.org/wiki/Elliptic\_integral [accessed 2021-10-01]

<sup>6</sup> MATLAB evaluates <sup>1</sup> <sup>0</sup> to infnity and the limit in positive infnity of the arccotangent function is *<sup>π</sup>* <sup>2</sup> (or roughly 1*.*5708). Yet, the interpretation of the division by zero is not wrong, since it follows the ofcial IEEE 754 standard for foating-point arithmetic [170].

much more complicated formulae compared to the expressions in Table 1.2 or even equation (1.1). Moreover, if an error occurs, the cause of the error can be very challenging to detect and traced back to its origin. The issue of translating arccot(*x*) to Maple, for example, may remain undiscovered until a user calculates negative values. If the function is embedded into a more complex equation, even experts can lose track of potential issues. In combination with unreliable translation tools, working with CAS may even be frustrating. Mathematica, for example, is able to import our test expression (1.1) mentioned earlier without throwing an error7. However, investigating the imported expression reveals an incorrect translation due to an issue with factorials. To productively work with CAS, our hypothetical researcher from above needs to carefully evaluate if the automatically imported expression was correct. As a consequence, existing translation approaches are not practically useful.

In this thesis, I will focus on discovering the information needs to perform correct translations from presentational formats, here mainly LATEX, to computational formats, here mainly CAS syntaxes. My personal motivation is to improve the workfow of researchers by providing them a reliable translation tool that ofers crucial additional information about the translation process. Further, I limit the support of such a translation tool to general-purpose CAS, since many general mathematical expressions simply cannot be translated to appropriate CAS expressions for task-specifc CAS (or other mathematical software, such as theorem provers). The focus on general-purpose CAS allows me to provide a broad solution to a general audience. Note further that, in this thesis, I mostly focus on the two major CAS Maple and Mathematica. However, the goal is to provide a translation tool that is easy to extend and support more CAS.

Further, the real-world applications of such a translation tool go far beyond an improved workfow with CAS. A computable formula can be automatically verifed with CAS [51, 52, 2, 8, 13, 153, 184, 414, 415], translated to other semantically enhanced formats, such as Open-Math [53, 57, 119, 152, 303, 361], content MathML [59, 60, 159, 270, 318, 342] or other CAS syntaxes [110, 361], imported to theorem prover [35, 57, 152, 163, 338, 375], or embedded in interactive documents [85, 131, 150, 162, 201, 284]. Since an appropriate translation is generally context-dependent, a translator must use MathIR [141] techniques to access sufcient semantic information. Hence, advances in translating LATEX to CAS syntaxes also contribute directly towards related MathIR tasks, including entity linking [150, 208, 212, 316, 319, 321, 322], math search engines [92, 181, 182, 203, 211, 236, 274], semantic tagging of math formulae [71, 402], recommendation systems [30, 31, 50, 319], type assistance systems [103, 106, 14, 321, 400], and even plagiarism detection platforms [253, 254, 334].

### **1.2 Research Gap**

Existing translation approaches from presentational formats to computable formats share the same issues. Currently, these translation approaches are


<sup>7</sup> If the binomial is given with the \binom macro rather than \choose.


Issue 4 raises from the fact that there are semantically enhanced data formats that have been specifcally developed to make expressions between CAS interchangeable, such as Open-Math [119, 303, 361] and content MathML [318, 343]. Nonetheless, most CAS do not support OpenMath natively [303] and the support for content MathML is limited to school mathematics [318]. The reason is that such translation requires a database that maps functions between diferent semantic sources. As discussed above, creating such a comprehensive database can be time-consuming due to slight diferences between the systems (e.g., positions of branch cuts, diferent supported domains, etc.) [361]. Hence, for economic reasons, crafting and maintaining such a library is unreasonable. Translations between semantic enhanced formats, e.g., between CAS syntaxes, OpenMath, or content MathML, are consequentially often unreliable.

In previous research, I was focusing on the issues 2-4 by developing a rule-based LATEX to CAS translator, called LACAST. Originally, LACAST performs translations from semantic LATEX to Maple. Relying on semantic LATEX allows LACAST to largely ignore the ambiguity Issue 1 and focus on the other problems. For this thesis, I continued to develop LACAST to further mitigate the *limitation* and *infexibility* issues 3 and 4. Further, I focused on extending LACAST to become the frst context-aware translator to tackle the *context-independency* issue 1.

### **1.3 Research Objective**

This doctoral thesis aims to:

### **Research Objective**

Develop and evaluate an automated context-sensitive process that makes presentational mathematical expressions computable via computer algebra systems.

Hereafter, I consider the semantic information of a mathematical expression as sufcient if a translation of the expression into the syntax of a CAS becomes feasible. To achieve the research objective, I defne the following fve research tasks:

### **Research Tasks**


### **1.4 Thesis Outline**

**Chapter 1** provides an introduction for translating presentational mathematical expressions into computable formats. The chapter further defnes the research gap for such translations and defnes the research objective and tasks this thesis addresses. Finally, it outlines the structure of the thesis and briefy summarizes the main publications.

**Chapter 2** provides an overview of related work by examining existing mathematical formats and translation approaches between them. This chapter focuses on **Research Task I** by analyzing the strengths and weaknesses of existing translation approaches with the main focus on the standard formats LATEX and MathML.

**Chapter 3** addresses **Research Task II** by studying the capability of math embeddings, introducing a new concept to describe the nested structure of mathematical objects, and presenting a novel context-sensitive semantifcation process for LATEX expressions.

**Chapter 4** presents the frst context-sensitive LATEX to CAS translator: LACAST. In particular, this chapter focuses on **Research Tasks III** and **IV** by implementing the previously introduced semantifcation process and integrates it into the rule-based semantic LATEX to CAS translator L ACAST. In addition, the chapter briefy summarizes a context-independent neural machine translation approach to estimate how much structural information is encoded in mathematical expressions.

**Chapter 5** evaluates the new translation tool LACAST and, therefore, contributes mainly towards **Research Task V**. In particular, it introduces the novel evaluation concept of equation verifcations to estimate the appropriateness of translated CAS expressions. Our new evaluation concept not only detects issues in the translation pipeline but is also able to identify errors in the source equation, e.g., from the DLMF or Wikipedia, and the target CAS, e.g., Maple or Mathematica. In order to maximize the number of verifable DLMF equations via our novel evaluation technique, this chapter also introduces some heuristic extensions to the LACAST pipeline. Hence, this chapter partially contributes to **Research Task IV** too.

**Chapter 6** concludes the thesis by summarizing contributions and their impact on the MathIR community. It further provides a brief overview of the remaining issues and future work.

**An Appendix** is available in the electronic supplementary material and provides additional information about certain aspects of this thesis including an extended error analysis, result tables, and a summary of bugs and issues we discovered with the help of LACAST in the DLMF, Maple, Mathematica, and Wikipedia.

### **1.4.1 Publications**

Most parts of this thesis were published in international peer-reviewed conferences and journals. Table 1.5 provides an overview of the publications that are reused in this thesis. The frst column identifes the chapter a publication contributed to. The venue rating was taken from the Core ranking<sup>8</sup> for conferences and the Scimago Journal Rank (SJR)<sup>9</sup> for journal articles. Each rank

<sup>8</sup> http : / / portal . core . edu . au / conf - ranks/ with the ranks: A\* – fagship conference (top 5%), A – excellent conference (top 15%), B – good conference (top 27%), and C – remaining conferences [accessed 2021-10-01].

<sup>9</sup> https://www.scimagojr.com/ with the ranks Q1 – Q4 where Q1 refer to the best 25% of journals in the feld, Q2 to the second best quarter, and so on [accessed 2021-10-01].

was retrieved for the year of publication (or year of submission, in case the paper has not been published yet). Table 1.6 similarly shows publications that partially contributed towards the goal of this thesis but are not reused within a chapter. Note that the publication [3] (in Table 1.6) was part of my Master's thesis and contributed towards this doctoral thesis as a preliminary project. The Journal publication [13] (also in Table 1.6) is an extended and (with new results) updated version of the thesis and the mentioned article [3]. The venue abbreviations in both tables are explained in the glossary. Lastly, note that the TPAMI journal [11] is reused in Chapter 4 (for the methodology) and in Chapter 5 (for the evaluation) to provide a coherent structure. My publications, talks, and submissions are separated from the general bibliography in the back matter and can be found on page 171.


Table 1.5: Overview of the primary publications in this thesis.

Table 1.6: Overview of secondary publications that partially contributed to this thesis.


### **1.4.2 Research Path**

This section provides a brief overview of my research path that led to this thesis, i.e., it discusses the primary publications and the motivations behind them. Every publication is marked with the associated chapter and a reference. This research path is logically (not chronologically) divided into three sections: preliminary work, the semantifcation of LATEX, and the evaluation of translations.

**Preliminary Work** I had the frst contact with the problem of translating LATEX to CAS syntaxes during my undergraduate studies in mathematics. During that time, I regularly used

<sup>10</sup>The methodology part of this journal is reused in Chapter 4 while the evaluation part is reused in Chapter 5.

CAS like MATLAB and SymPy for numeric simulations and for plotting results. At the same time, we were required to hand in our homework as LATEX fles. While exporting content from the CAS to LATEX fles was rather straight forward, the other way around, i.e., importing LATEX into the CAS, required manual conversions. I decided to explore the reasons for this shortcoming in my Master's thesis. During that time, I developed the frst version of a semantic LATEX to CAS translator, which was later coined LACAST11. The results from this frst study were published at the Conference of Intelligent Computer Mathematics (CICM) in 2017.

 *"Semantic Preserving Bijective Mappings of Mathematical Formulae Between Document Preparation Systems and Computer Algebra Systems"* by Howard S. Cohl, Moritz Schubotz, Abdou Youssef, **André Greiner-Petter**, Jürgen Gerhard, Bonita Saunders, Marjorie McClain, Joon Bang, and Kevin Chen. **In:** *Proceedings of the International Conference of Intelligent Computer Mathematics* (CICM), 2017. Not Reused — [3]

This frst version of LACAST focused specifcally on the CAS Maple but was designed modularly to allow later extensions to other CAS. The main limitation of LACAST, however, was the requirement of using semantic LATEX macros to disambiguate mathematical expressions manually. An automatic disambiguation process did not exist at the time. Moreover, only a few previous projects focused on a semantifcation for translating mathematical formats. Hence, I continued my research in this direction.

In the following, I will use 'we' rather than 'I' in the subsequent parts of this thesis, since none of the presented contributions would have been possible without the tremendous and fruitful discussions and help from advisors, colleagues, students, and friends.

**Semantification of LATEX** As an alternative for semantic LATEX, we closely investigated existing converters for MathML frst (see Section 2.2.1). Since MathML was (and still is) the standard encoding for mathematical expressions in the web, most CAS support MathML. MathML uses two markups, presentation and content MathML. The former visualizes a formula, while the latter describes the semantic content. Hence, content MathML can disambiguate math much like semantic LATEX. Since MathML is the ofcial web standard and LATEX the de-facto standard for writing math, there are numerous of converters available that translate LATEX to MathML. As our frst contribution, we developed MathMLben, a benchmark dataset for measuring the quality of MathML markup that appears in a textual context. With this benchmark, we evaluated nine state-of-the-art LATEX to MathML converters, including Mathematica as a major CAS. We published our results in the Joint Conference on Digital Libraries (JCDL) in 2018.

 *"Improving the Representation and Conversion of Mathematical Formulae by Considering their Textual Context"* by Moritz Schubotz, **André Greiner-Petter**, Philipp Scharpf, Norman Meuschke, Howard S. Cohl, and Bela Gipp. **In:** *Proceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries* (JCDL), 2018.

Chapter 2 — [18]

<sup>11</sup>*LaTeX to CAS Translator*.

We discovered that three of the nine tools were able to generate content MathML but with insufcient accuracy. None of the available tools were capable of analyzing a context for a given formula. Hence, the converters were unable to conclude the correct semantic information for most of the symbols and functions. In our study, we proposed a manual semantifcation approach that semantically enriches the translation process of existing converters by feeding them semantic information from the surrounding context of a formula. The enrichment process was manually illustrated via the converter LATExml, which allowed us to add custom semantic macros to improve the generated MathML data. In fact, we used this manual approach to create the entries of MathMLben in the frst place.

Naturally, our next goal was to automatically retrieve semantic information from the context of a given formula. Around this time, word embeddings [256] began to gain interest in the MathIR community [121, 215, 242, 400, 404]. It seems that vector representations were able to capture some semantic properties of tokens in natural languages. Can we create such semantic vector representations of mathematical expressions too? Unfortunately, we discovered that the related work in this new area of interest did not discuss a crucial underlying issue with embedding mathematical expressions. In math expressions, certain symbols or entire groups of tokens are fxed, such as the red tokens in the Gamma function Γ(*x*) or the Jacobi polynomial *Pn*(*α,β*) (*x*), while other may vary (gray). Inspired by words in natural languages, we call these fxed tokens the stem of a mathematical object or operation. Unfortunately, in mathematics, this stem is context-dependent. If *π* is a function, the red tokens are its stem *π*(*x* + *y*). However, if *π* is not a function, the stem is just the symbol itself *π*(*x* + *y*). If we do not know the stem of a mathematical object, how can we group them so that a trained model understands the connection between variations like Γ(*z*) and Γ(*x*)? The answer is: we cannot. The only alternative is to use context-independent representations, e.g., we only embed the identifers or the entire expression. Each of these approaches has advantages and disadvantages. We shared our discussion with the community at the BIRNDL Workshop at the conference on Research and Development in Information Retrieval (SIGIR) in 2019.

 *"Why Machines Cannot Learn Mathematics, Yet"* by **André Greiner-Petter**, Terry Raus, Moritz Schubotz, Akiko Aizawa, William I. Grosky, and Bela Gipp. **In:** *Proceedings of the 4th Joint Workshop on Bibliometric-Enhanced Information Retrieval and Natural Language Processing for Digital Libraries* (BIRNDL@SIGIR), 2019.

Chapter 2 — [9]

Nonetheless, context-independent math embeddings still have many valuable applications. Search engines, for example, can proft from a vector representation that represents a mathematical expression in a particular context. Such a trained model would still be unable to tell us what the expression is, but it can tell us efciently if the expression is *semantically similar* (e.g., because the surrounding text is similar) to another expression. Further, embedding semantic L ATEX allows us to overcome the issue of unknown stems for most functions since the macro unambiguously defnes the stem. Youssef and Miller [404] trained such a model on the DLMF formulae. Later, we published an extended version of our workshop paper together with Youssef and Miller in the *Scientometrics* journal.

 *"Math-Word Embedding in Math Search and Semantic Extraction"* by **André Greiner-Petter**, Abdou Youssef, Terry Raus, Bruce R. Miller, Moritz Schubotz, Akiko Aizawa, and Bela Gipp. **In:** *Scientometrics* 125(3): 3017-3046, 2020.

Chapter 3 — [15]

Unfortunately, this sets us back to the beginning, where we need manually crafted semantic L ATEX. We started to investigate the issue of interpreting the semantics of mathematical expressions from a diferent perspective. As we will see later in Section 2.2.4, humans tend to visualize mathematical expressions in a tree structure, where operators, functions, or relations are parent nodes of their components. Identifers and other terminal symbols are the leaves of these trees. The MathML tree data structure comes close to these so-called *expression trees* (see Section 2.2.4) but does not strictly follow the same idea [331]. The two aforementioned context-independent approaches to embed mathematical expressions take either the leaves or the roots of such trees. The subtrees in between are the context-dependent mathematical objects we need. Not all subtrees, however, are meaningful, and the mentioned expression trees are only theoretical interpretations. In searching for an approach to discover meaningful subexpressions, which we call Mathematical Objects of Interest (MOI), we performed the frst large-scale study of mathematical notations on real-world scientifc articles. In this study, we followed the assumption that every subexpression with at least one identifer can be semantically important. Hence, we split every formula into their MathML subtrees and analyzed their frequency in the corpora. Overall, we analyzed over 2.5 Billion subexpressions in 300 Million documents and showed that the frequency distribution of mathematical subexpressions is similar to words in natural language corpora. By applying known frequency-based ranking functions, such as BM25, we were also able to discover topic-relevant notations. We published these results at The Web Conference (WWW) in 2020.

 *"Discovering Mathematical Objects of Interest — A Study of Mathematical Notations"* by **André Greiner-Petter**, Moritz Schubotz, Fabien Müller, Corinna Bretinger, Howard S. Cohl, Akiko Aizawa, and Bela Gipp. **In:** *Proceedings of the Web Conference* (WWW), 2020.

Chapter 3 — [14]

The applications that we derived from simply counting mathematical notations were surprisingly versatile. For example, with the large set of indexed math notations, we implemented the frst type assistant system for math equations, developed a new faceted search engine for zb-MATH, and enabled new approaches to measure potential plagiarism in equations. Besides these practical applications, it also gave us the confdence to continue focusing on subexpressions for our LATEX semantifcation. Previous projects that aimed to semantically enrich mathematical expressions with information from the surrounding context primarily focused on one of the earlier mentioned extremes, i.e., the leaves or roots in expression trees [139, 214, 279, 329, 330]. Our study also revealed that the majority of unique mathematical formulae are neither single identifer nor highly complex mathematical expressions. Hence, we concluded that we should focus on semantically enriching subexpressions (subtrees) rather than the roots or leaves. We proposed a novel context-sensitive translation approach based on semantically annotated MOI and shared our theoretical concept with the community at the International Conference on Mathematical Software (ICMS) in 2020.

 *"Making Presentation Math Computable: Proposing a Context Sensitive Approach for Translating LaTeX to Computer Algebra Systems"* by **André Greiner-Petter**, Moritz Schubotz, Akiko Aizawa, and Bela Gipp. **In:** *Proceedings of the International Conference on Mathematical Software* (ICMS), 2020.

Chapter 3 — [10]

Afterward, we started to realize the proposed pipeline with a specifc focus on Wikipedia. We focused on this encyclopedia for two reasons. First, Wikipedia is a free and community-driven encyclopedia and, therefore, (a) less strict on writing styles and (b) more descriptive compared to scientifc articles. Second, Wikipedia can actively beneft from our contribution since additional semantic information about mathematical formulae can support users of all experience levels to read and comprehend articles more efciently [150]. Moreover, a successful translation from a formula in Wikipedia to a CAS makes the formula computable which enables numerous of additional applications. In theory, a mathematical article could be turned into an interactive document to some degree with our translations. However, the most valuable application of a translation of formulae in Wikipedia would be the ability to check equations for their plausibility. With the help of CAS, we are able to analyze if an equation is semantically correct or suspicious. This evaluation would enable existing quality measures in Wikipedia to incorporate mathematical equations for the frst time. The results from our novel context-sensitive translator including the plausibility check algorithms have been accepted for publication in the IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) journal and are currently in press.

 *"Do the Math: Making Mathematics in Wikipedia Computable."* **André Greiner-Petter**, Moritz Schubotz, Corinna Bretinger, Philipp Scharpf, Akiko Aizawa, and Bela Gipp. In press: *IEEE Transactions on Pattern Analysis and Machine Intelligence* (TPAMI), 2021.

Chapter 4 & 5 — [11]

Currently, we are also actively working on extending the backbone of Wikipedia itself for presenting additional semantic information about mathematical expressions by hovering over or clicking on the formula. This new feature helps Wikipedia users to better understand the meaning of mathematical formulae by providing details on the elements of formulae. Moreover, it paves the way towards an interface to actively interact with mathematical content in Wikipedia articles. We presented our progress and discussed our plans in the poster session at the JCDL in 2020.

 *"Mathematical Formulae in Wikimedia Projects 2020."* Moritz Schubotz, **André Greiner-Petter**, Norman Meuschke, Olaf Teschke, and Bela Gipp. **In:** Poster Session at the *ACM/IEEE Joint Conference on Digital Libraries* (JCDL), 2020.

Chapters 6 — [17]

**Evaluating Digital Mathematical Libraries** Alongside this main research path, we continuously improved and extended LACAST with new features and new supported CAS. Our frst goal was to verify the translated, now computable, formulae in the DLMF. The primary motivation behind this approach was to quantitatively measure the accuracy of LACAST translations. How can we very if a translation was correct? The well-established Bilingual Evaluation Understudy (BLEU) [282] measure in natural language translations is not directly applicable for mathematical languages because an expression may contain entirely diferent tokens but is still equivalent to the gold standard. Since the translation is computable, however, we can take advantage of the power of CAS to verify a translation. The basic idea is that a human-verifed equation in one system must remain valid in the target system. If this is not the case, only three sources of errors are possible: either the source equation, the translation, or the CAS verifcation was incorrect. With the assumption that equations in the DLMF and major proprietary CAS are mostly error-free, we can translate equations from the DLMF to discover issues within LACAST. First, we focused on symbolic verifcations, i.e., we used the CAS to symbolically simplify the diference between left- and right-hand side of an equation. If the simplifed diference is 0, the CAS symbolically verifed the equivalence of the left- and right-hand side and confrmed a correct translation via LACAST. Additionally, we extended the verifcation approach to include more precise numeric evaluations. If a symbolic manipulation failed to return 0, it could also mean the CAS was unable to simplify the expression. We numerically calculate the diference on specifc test values and check if the diference is below a given threshold to overcome this issue. If all test calculations are below the threshold, we consider it numerically verifed. Even though this approach cannot verify equivalence, it is very efective in discovering disparity. We published the frst paper with this new verifcation approach based on Maple at the CICM in 2018.

 *"Automated Symbolic and Numerical Testing of DLMF Formulae Using Computer Algebra Systems"* by Howard S. Cohl, **André Greiner-Petter**, and Moritz Schubotz. **In:** *Proceedings of the International Conference on Intelligent Computer Mathematics* (CICM), 2018.

Chapter 5 — [2]

The extension of the system and the new results led us to an extended journal version of the initial LACAST publication [3]. This extended version mostly covered parts of my Master's thesis and is not reused in this thesis. For technical details about LACAST, see the journal publication [13]. In Appendix D available in the electronic supplementary material, we summarized all signifcant issues and reported bugs we discovered via LACAST. The section also includes new issues that we discovered during the work on the journal publication. This journal version was published in the Aslib Journal of Information Management in 2019.

 *"Semantic preserving bijective mappings for expressions involving special functions between computer algebra systems and document preparation systems"* by **André Greiner-Petter**, Howard S. Cohl, Moritz Schubotz, and Bela Gipp. **In:** *Aslib Journal of Information Management* 71(3): 415-439, 2019.

Appendix D — [13]

It turned out that LACAST translations on semantic LATEX were so stable that we can use the same approach for verifying translations also to specifcally search for errors in the DLMF and issues in CAS. To maximize the number of supported DLMF formulae, we implemented additional heuristics to LACAST, such as a logic to identify the end of a sum or to correctly interpret prime notations as derivatives. Additionally, we added support for translations to Mathematica and SymPy. We extended the support for Mathematica even further to perform the same verifcations in Maple also in Mathematica. The Mathematica support fnally allows us to identify computational diferences in two major proprietary CAS. Moreover, we extended the previously introduced symbolic and numeric evaluation pipeline with more sophisticated variable extraction algorithms, more comprehensive numeric test values, resolved substitutions, and improved constraint-awareness. All discovered issues are summarized in Appendix D available in the electronic supplementary material. We further made all translations of the DLMF formulae publicly available, including the symbolic and numeric verifcation results. The results of this recent study have been published at the international conference on Tools and Algorithms for the Construction and Analysis of Systems (TACAS).

 *"Comparative Verifcation of the Digital Library of Mathematical Functions and Computer Algebra Systems"* by **André Greiner-Petter**, Howard S. Cohl, Abdou Youssef, Moritz Schubotz, Avi Trost, Rajen Dey, Akiko Aizawa, and Bela Gipp. **In:** *Tools and Algorithms for the Construction and Analysis of Systems (TACAS)*, 2022.

Chapter 5 — [8]

We also applied the same verifcation technique to the Wikipedia articles we mentioned earlier, which enabled LACAST to symbolically and numerically verify even complex equations in Wikipedia articles. This evaluation is also part of the TPAMI submission.

**15**

Preprints of my publicationsare available at https://pub.agp-research.com

My Google Scholar profle is available at https://scholar.google.com/citations?user=Mq2B9ogAAAAJ

> All translations of the DLMF formulae are available at https://lacast.wmflabs.org

A prototype of LACAST for Wikipedia is available at https://tpami.wmflabs.org

This Chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).

*I don't know half of you half as well as I should like, and I like less than half of you half as well as you deserve.*

Bilbo Baggins - *The Lord of the Rings*

### **CHAPTER 2**

### **Mathematical Information Retrieval**

### **Contents**


**Supplementary Information** The online version contains supplementary material available at https://doi.org/10.1007/978-3-658-40473-4\_2.

© The Author(s) 2023 A. Greiner-Petter, *Making Presentation Math Computable*, https://doi.org/10.1007/978-3-658-40473-4\_2

Making presentational math computable implies a transformation from one mathematical representation to another. In order to frame this task, we need to introduce presentational and computable formats, and analyze available transformation tools between these formats. There is a large variety of diferent formats available to encode mathematical expressions, from visual formats, such as LATEX [220] or MathML [60], to semantic enhanced encodings, such as content MathML [270], semantic LATEX [260], STEX [200], or OpenMath [19], and entire programming languages, such as CAS syntaxes [36, 128, 173, 175, 176, 177, 178, 393], theorem provers [37, 266, 287, 340, 354, 384], or mathematical packages in C++ [168], Python [252] or Java [79]. This chapter introduces what we understand as presentational and computable formats, provides an overview of math formats, and discusses existing transformation tools between these formats.

In particular, Section 2.1 introduces presentational and computable formats. Section 2.2 provides an extensive overview of mathematical formats, their attributes, and conversion approaches between them. Since there are a large variety of conversion tools and approaches available for many diferent formats [39, 200, 18, 351, 406] a translation from a presentational to a computable format can be achieved in many diferent ways. In this thesis, we mainly focus on translations from LATEX to CAS syntaxes. The most well-studied translation path from LATEX to CAS syntaxes would use content MathML as an intermediate, semantically enriched format. Hence, Section 2.3 analyzes state-of-the-art LATEX to MathML converters. Section 2.4 underlines the research gap and paves the way for the rest of the thesis by briefy discussing MathIR approaches for conversions from presentational to computable formats. Section 2.3 has been published at the JCDL [18]. The introduction of math embeddings in Section 2.2 was published as a workshop paper at the SIGIR conference [9] and later reused in an extended article for the Scientometrics journal [15].

### **2.1 Background and Overview**

Computable encodings are interpretable formal languages in which keywords or sequences of tokens are associated with specifc implemented defnitions, which allows performing certain mathematical actions on these elements, such as evaluating numeric values or symbolically manipulating the elements. Computable encodings, therefore, must be semantically unambiguous. Otherwise, an interpreter is unable to associate the sequence of tokens with a unique underlying defnition. This ambiguity problem is mainly solved by interpreters in two ways: either the system automatically performs disambiguation steps following a decision tree with a fxed set of internal rules, such as x^y^z in Mathematica, or the system refuses to parse the expression and returns an error, such as for x^y^z in Maple.

> **Computable formats** are formal languages that link key words or phrases with unique implemented defnitions. Computable expressions are semantically unambiguous.

Presentational formats, on the other hand, focus on controlling the visualization of mathematical formulae. They generally allow users to change spaces between tokens (e.g., \, and \; in LATEX), support two-dimensional visualizations (e.g., *<sup>b</sup> <sup>a</sup> dx <sup>x</sup>* ), or render entire graphs and images. However, pure presentational formats (in contrast to enhanced semantic encodings) do not specify the meaning of an expression. Consequently, mathematical expressions in presentational for-



mats are generally semantically ambiguous, and it is the author's responsibility to disambiguate the meaning of the expression by providing additional information in the context. Digital presentational formats, such as LATEX, are also interpretable formal languages1. In contrast to computable formats, presentational languages link tokens with specifc visualizations rather than executable subroutines. Hence, expressions in these formats must be *unambiguous* too. Otherwise, interpreters are unable to link an expression with a unique visualization (see x^y^z in LATEX). The diference to computable encodings is that expressions in presentational formats must be visually but not semantically unambiguous. For instance, LATEX refuses to parse x^y^z because the rendering of {x^y}^z (see *xyz*) and x^{y^z} (see *xy<sup>z</sup>* ) is diferent. In contrast, Maple rejects x^y^z because there is a mathematical (and in consequence a computational) diference between (*xy*)*<sup>z</sup>* and *x*(*yz*) .

 **Presentational formats** are formal languages with a focus on visualization. Presentational expressions can be semantically but not visually ambiguous.

In this thesis, we focus on LATEX for the presentational format and CAS syntaxes for computable formats. We choose LATEX because it is currently the de-facto standard for writing scientifc papers in the STEM disciplines [129, 402]. Several other word processors, such as the article's editor in Wikipedia<sup>2</sup> or Microsoft's Word [248], entirely or partially support LATEX inputs. In addition, LATEX is the main presentational format that is entered by hand. In contrast, MathML, due to its XML datastructure, is not a user-friendly3 encoding and mostly automatically generated from other formats [82, 159, 18, 374]. Image formats are the result of pictures, scans, or handwritten inputs, and, therefore, less machine-readable. As a consequence, image formats of mathematical formulae are mainly converted into LATEX or MathML in a pre-processing step [27, 39, 267, 378, 379, 406, 411]. We choose CAS syntaxes for our target computable format because CAS generally support a large variety of diferent use cases, from manipulations and visualizations to computations and simulations [81, 413]. Especially general-purpose CAS, such as Maple [36] and Mathematica [393], address a broad range of topics [128, 392]. In contrast, theorem provers, proof assistants, and similar software, as potential other computable formats, solely focus on automated reasoning [147, 266, 354, 384]. Hence, the computation of mathematical formulae plays a less signifcant role in such software.

### **2.2 Mathematical Formats and Their Conversions**

Figure 2.1 provides an overview of diferent math encodings and existing conversion approaches between them. In addition to the fgure, Table 2.1 provides quick access to references for specifc translation directions. Figure 2.1 organizes formats by their level of semantics and the level of machine readability. This categorization is not meant to be as accurate as possible nor to be complete. Instead, the fgure aims to provide a rough visualization of the most common encodings and their diferences. For instance, there is no notable technical diference between

**19**

<sup>1</sup> Note that this interpretation of presentational formats does not include images. Since images are less machinereadable formats, they are generally frst converted into interpretable formats, such as LATEX. This conversion process is very challenging on its own [406, 411]. Hence, including images for our task would not provide any benefts but makes it unnecessarily more complicated.

<sup>2</sup> https://en.wikipedia.org/wiki/Help:Displaying\_a\_formula [accessed 2021-10-01]

<sup>3</sup> A little histrionically described as '*Making humans edit XML is sadistic!*' from the Django 1.7.11 documentation [118].

Figure 2.1: Reference map of mathematical formats and translations between them. The red path illustrates the main subject of this thesis. In Section 2.3, we focus specifcally on existing translation approaches from LATEX to MathML (orange arrows) to evaluate an alternative to the red translation path.

the levels of semantics in content MathML and OpenMath (see the paragraph about OpenMath in Section 2.2.1). Nonetheless, OpenMath defnes the content dictionaries that content MathML uses to semantically annotate symbols beyond school mathematics. Hence, we could argue that content MathML encodes less semantic information without the help of OpenMath and, therefore, should be positioned more to the left. Another disparity can be found in the level of machine readability between CAS syntaxes and theorem provers. Since both formats are programming languages, any CAS or theorem prover expression requires a very specifc (often proprietary) parser. Thus, a programming language is arguably never more *machine readable* than any other programming language. Nonetheless, most CAS prefer a more intuitive input format (sometimes even 2D inputs) similar to LATEX over a machine-readable syntax [88, 128, 179] to improve their user experience. Because of these more user-friendly input formats, we positioned CAS syntaxes below theorem prover formats. Note also that math embeddings, i.e., vector representation of math tokens, are not in Figure 2.1 because the level of semantics these vectors capture is still unclear and an open research question (see Section 2.2.5). The red path in Figure 2.1 shows the new translation path that we focus on in this thesis. Dotted arrows represent translation paths that generally do not require context analysis and are, therefore, of less interest for the subject of this thesis. The orange and red arrows (and highlighted cells in Table 2.1) refer to our contributions for this thesis. The red arrows refer the main research contribution explained in the chapters 3 and 4.

### **2.2.1 Web Formats**

Web formats are designed to display mathematical formulae and knowledge on the web. Consequently, those formats prioritize machine readability over user experience. Hence, a variety of diferent translation approaches to, from, or between web formats exists. Since mathematics in the web is generally embedded in HTML code, most web formats use the XML encoding Table 2.1: Overview table of available mathematical format translations. The highlighted conversion felds refer to contributions made in this thesis. The columns and rows refer to: 'pMML' for presentation MathML, 'cMML' for content MathML, 'sem.LaTeX' for semantic L ATEX, 'Theo. Prov.' for theorem prover or proof assistants, 'Img' for images, and 'Speech' for spoken (audio) mathematical content. The group 'Comp.' refers to computable formats. In some cases, no transformation is necessary, e.g., from OMDoc to OpenMath because OMDoc uses OpenMath internally. In this (and similar) cases, we simply refer to the overview publication of the format, here [198] for OMDoc.


structure. Thus, web formats are often described as verbose and rarely edited or created by hand. On the other hand, the XML structure simplifes the inter-connectivity between web formats, e.g., via XSL Transformations (XSLT) [362]. There are three main formats used in the web: the current web standard MathML, the pure semantic encoding OpenMath, and the semantic document encoding OMDoc. Note that many websites still use image formats to display math. We will discuss image formats in Section 2.2.4.

### **2.2.1.1 MathML**

For the web, the Mathematical Markup Language (MathML) [60] is the current ofcial recommendation from the World Wide Web Consortium (W3C) and even an ofcial standard since 2015 [169] for HTML5. MathML is defned via two diferent markups: the *presentation*<sup>4</sup> and

<sup>4</sup> https://www.w3.org/TR/MathML3/chapter3.html [accessed 2021-10-01]

the *content*<sup>5</sup> markup. MathML containing only presentation markup elements is, therefore, also called presentation MathML or, in case it only contains content markup elements, content MathML, respectively. Both markups can be used together side by side for a single expression in so-called parallel markup [202, 259, 270]. If elements in the presentation markup are linked back and forth with elements from the content markup, the encoding is also called cross-referenced MathML.

Content MathML, in contrast to presentation MathML, aims to encode the meaning, i.e., the semantics, of mathematical expressions. Content MathML addresses the issues of ambiguous presentational encodings by providing a standard representation of the content of mathematics. The encoding comes with a large number of predefned functions, e.g., for sin and log, intending to cover most of K-146 mathematics. For formulae beyond school mathematics, content MathML use so-called Content Dictionaries (DCs) [204] (see the OpenMath paragraph for more details about CDs). Listing 2.1 shows presentation and content MathML encodings for the Legendre polynomial *Pn*(*x*). Note that the presentation MathML encoding contains an operator (<mo> for *mathematical operator*) between *P<sup>n</sup>* and (*x*) which contains the invisible character *function application* (unicode character U+2061). Nowadays, content MathML is often used in digital libraries to improve the performance of math search engines with accessible semantic information [345, 347, 348, 381].

Since MathML is the web standard, there are numerous tools available that convert other encodings from and to MathML. Most common conversions include translations between presentation and content MathML [139, 270, 364], from [159, 257, 267, 335, 374] and to7 L ATEX, OpenMath [59, 342, 343], CAS [318], PDF [27, 267], images [406], and audio encodings (mainly in the math to speech research feld) [67, 349, 387]. The W3C ofcially lists 42 converters and other softare tools that generate MathML on their wiki8. In addition, the ofcial *interoperability report*<sup>9</sup> of MathML provides a comprehensive overview of software that supports MathML and show ofcial statements from implementors. Due to its XML format, most conversion tools use XSLT [362] to transform MathML into either other XML encodings or string representations [59, 61]. This translation approach can be described as rule-based, because in XSLT, we defne a set of transformation rules for XML subtrees.

Most of the converters to MathML do not support content MathML. Translations from presentational formats to content MathML face a wide range of ambiguity issues [159, 259, 374]. For example, the <mo> element in Listing 2.1 regularly contains the *invisible times* symbol (unicode character U+2062) rather than *function application* because most conversion tools interpret *P<sup>n</sup>* not as a function. For content MathML, even more disambiguation steps are required to link *P* with the Legendre polynomial correctly. For such disambiguation, a combination of semantifcation and XSLT rules are used to perform translations to content MathML [139, 270, 364]. Nghiem et al. [270] proposes a machine translation approach to generate content MathML from presentation MathML but does not consider textual descriptions from the surrounding context of a formula. Likewise, Toloaca and Kohlhase [364] uses patterns of notation defnitions

<sup>5</sup> https://www.w3.org/TR/MathML3/chapter4.html [accessed 2021-10-01] 6 Kindergarten to early college.

<sup>7</sup> Two well-known projects for translations from MathML to LATEX use XSL transformations: web-xslt https://github.com/davidcarlisle/web-xslt/tree/main/pmml2tex and mml2tex https://github.com/transpect/mml2tex [accessed 2021-10-01].

<sup>8</sup> https://www.w3.org/wiki/Math\_Tools [accessed 2021-10-01]

<sup>9</sup> https://www.w3.org/Math/iandi/mml3-impl-interop20090520.html [accessed 2021-10-01]

to fnd a content MathML expression that matches the presentation MathML parse tree. Grigore et al. [139], on the other hand, generates a local context of fve nouns prior to the expression frst to conclude symbol declarations from OpenMath CDs. Besides Grigore et al. [139], other existing approaches for translations to content MathML only consider the semantics within the given formula itself or in formulae in the same document [159, 259, 374] but ignore the textual context surrounding a formula. For example, these tools follow the assumption that a *P* with subscript followed by an expression in parenthesis should be interpreted as the Legendre polynomial. However, many expressions cannot be disambiguated without considering the textual context, such as the *π*(*x* + *y*) example from the introduction.

Most CAS support MathML either directly or via external software packages [318, 343]. However, to the best of our knowledge, no CAS currently consider the CD in content MathML correctly. Hence, these import and export functions in CAS are generally limited to school mathematics. It should be noted that the CDs are considered by CAS but only in OpenMath, e.g., via the transport protocol *Symbolic Computation Software Composability Protocol* (SCSCP) [361]. Since this protocol was developed to enable inter-CAS communication, we explain this project more in detail in Section 2.2.3.

In summary, a reliable generation of content MathML requires a semantic enhanced source formula, e.g., in CAS syntaxes [318, 343], theorem prover formats [152], or OpenMath [59, 342]. Otherwise, translations tend to generate inaccurate MathML. In Section 2.3, we will examine existing LATEX to MathML converters more in detail to investigate the practicality of using MathML as an intermediate format for translations from LATEX to CAS encodings.

### **2.2.1.2 OpenMath**

The OpenMath Society (originally OpenMath Consortium [19]) defnes another standard encoding called OpenMath [53]. The OpenMath standard aims to focus exclusively on the semantics of mathematics and, therefore, going a step further compared to MathML [204], which aims to cover both the presentation and the content information in a single format. Originally, OpenMath was invented during a series of workshops starting in 1993, mainly from researchers in the computer algebra community, to easily exchange mathematical expressions between CAS and other systems [19, 89]. MathML, originally developed with the same goal, was frst released in 199810. Both formats are very similar to each other [204] and one may ask for the purpose of two diferent formats for more or less the same tasks [82, 114]. Discussions about the necessity of both formats raise from time to time even decades later [25, 204]. However, OpenMath and MathML have been and are still developed alongside each other rather than competing with one another due to a large overlap of people working on both formats [204]. To summarize the coexistence today: MathML provides rendered visualizations for OpenMath, while the Content Dictionaries (CDs) from OpenMath add semantics to MathML11.

The OpenMath Society maintains a set of standard CDs. A CD is a set of declarations (i.e., defnitions, notations, constraints, etc.) for mathematical symbols, functions, operators, and other mathematical concepts. The idea behind the publicly maintained CDs by the OpenMath

<sup>10</sup>https://www.w3.org/TR/1998/REC-xml-19980210 [accessed 2021-10-01]

<sup>11</sup>A more detailed discussion about the history of both formats can be found at https : / / open m ath . org / projects / esprit / final / node6 . htm, https : / / open m ath . org / om - m m l/ [both accessed 2021-10-01], and [198, pp. 5].

Listing 2.1: The Legendre polynomial in two MathML encodings and in OpenMath.

Society is to provide a ground truth for math declarations so that the used symbols become interchangeable among diferent parties. However, everybody can create new custom CDs which might be integrated into the existing standard set maintained by the OpenMath Society [90]. M. Schubotz [327], for example, proposed a concept for a CD that uses on the knowledge base Wikidata. More recently, B. Miller [258] created a content dictionary specifcally for the functions in the DLMF.

Listing 2.1 compares both MathML markups with OpenMath. While the tree structures of content MathML and OpenMath cannot directly be compared with mathematical expression trees [331] (see also Section 2.2.4), the XML tree structure of both formats is unique. Both formats rely on the CD entry of the Legendre polynomial in orthpoly112. Since the CD is from OpenMath, the OpenMath encoding does not require the entire url. The CD entry further specifes that the Legendre polynomial has two arguments. Hence, the following two siblings in the tree structure are considered to be the arguments. OpenMath specifcally annotate them as OMV (for variable objects). Alternatively to the orthpoly1 CD by OpenMath, one can also use Schubotz's [327] Wikidata CD to annotate *P* with the Wikidata item Q215405 or Miller's [258] DLMF CD to link *P* to §18.3 of the DLMF [98, (18.3)].

As previously mentioned, both formats (content MathML and OpenMath) are rather similar to each other [56, 343]. Hence, there are several ways to transform mathematical expressions between both formats [343], e.g., via XSLT [59, 342]. This transformation is possible without information retrieval techniques since both formats encode the same level of semantic information via CDs. Even though the primary goal for OpenMath was to provide a format that allows communication between mathematical software [19], most CAS do not support Open-Math directly. Instead, an independent project of research institutions funded by the European Union was launched to improve the *symbolic computation infrastructure in Europe*. The main

<sup>12</sup>https://openmath.org/cd/orthpoly1.html#legendreP [accessed 2021-10-01]

result of this project was the SCSCP protocol for inter-CAS communication via OpenMath. We will discuss the SCSCP protocol and the project more in detail in Section 2.2.3. Several CAS, including Maple [243] and Mathematica [44], implemented endpoints for the SCSCP protocol. Hence, via this new protocol, CAS support OpenMath to some degree. Apart from the protocol solution, there are some research projects available that use OpenMath as an interface to and between CAS and theorem prover formats [57, 152, 303, 338, 343].

### **2.2.1.3 OMDoc**

Sometimes, it might be worthwhile to annotate the context of mathematical expressions with additional information explicitly. For example, an equation might be part of a theorem that has not been proven yet. Hence, that particular equation and its context should not be confused with a defnition. Since this meta-information about mathematical expressions is organized on a document level, Kohlhase [198, 199] introduced another format, the Open Mathematical Document (OMDoc), to semantically describe entire mathematical documents. While formats like OpenMath or MathML encode the semantics of single expressions, which Kohlhase describes as the *microscopic* level, OMDoc aims for the *macroscopic*, i.e., the document level. This format can be especially useful for interactive documents [80, 85, 131, 150, 162, 201] and theorem prover [38, 146, 163, 340] which generally rely more on the meta information from a document level. Single math expressions in OMDoc are still encoded as OpenMath for the semantics and MathML for the visualization. In turn, this thesis focuses more on the formats that directly encode mathematical expressions rather than a *macroscopic* level encoding. Nonetheless, it should be noticed that a translation to a CAS might be diferent depending on the scope of an equation, e.g., an equation symbol in a defnition difers from an equation symbol in an example. Heras et al. [152], for example, used OMDoc to interface CAS and theorem prover. Hence, the OMDoc format might be worth supporting once the translation reaches a level of reliability and comprehensiveness that the semantics on the document level matter (see the future work section 6.3).

### **2.2.2 Word Processor Formats**

The previously explained formats of mathematics are benefcial for web applications and exchanging mathematical knowledge between systems. However, the underlying verbose XML data structure makes manual maintenance of these formats too cumbersome. In turn, MathML and OpenMath, considering a specifc size, are almost always computer-generated. The actual source of the data, something a human manually typed, uses a diferent format, such as LATEX, visual template editors, or image formats. In the following, we introduce formats and methods used to type mathematics in word processors manually.

### **2.2.2.1 LATEX**

L ATEX is currently the de-facto standard for writing scientifc papers in the STEM disciplines [129, 220, 402] and has even been described as *the lingua franca of the scientifc world* [220]. Numerous other word processors entirely or partially support LATEX inputs. LATEX was developed by Leslie Lamport and extended the TEX system with some valuable macros that make working with TEX easier [220]. TEX was developed by Donald E. Knuth [189, p.559] in 1977. Knuth was dissatisfed with the typography of his book, *The Art of Computer Programming* [189, pp. 5, 6, and 24] and created TEX to overcome the hurdles of consistently and reliably typesetting mathematical

**25**

formulae for printing. Today, there is no signifcant diference between LATEX and TEX in terms of mathematical expressions. Hence, we continue using LATEX as the modern successor and refer to TEX only to underline technical diferences or to describe the underlying base for other TEX-like encodings. LATEX provides an intuitive syntax for mathematics that is similar to the way a person would write the math by hand, e.g., by using the underscore to set a sequence of tokens in subscript.

L ATEX is an interpretable language that requires a parser. Theoretically, the fexibility of LATEX (and especially the underlying TEX implementation) makes parsing LATEX really challenging [187]. For example, TEX allows to redefne every literal at runtime, making TEX (and therefore LATEX too) to a context-sensitive formal language. However, in practice, most LATEX literals are generally not redefned. Instead, it is common to extend LATEX with additional commands rather than redefning existing logic. Especially in mathematical expressions, several projects simply presume that LATEX is parsable with a context-free grammar, which makes parsing mathematical expressions in LATEX a lot simpler [71, 402].

Since LATEX is the standard to typeset mathematics, there are numerous of translation tools to the webstandard MathML available [133, 135, 159, 257, 267, 335, 374] (see also MathML explanation in Section 2.2.1). In the next Section 2.3, we will focus more closely on translations between LATEX and MathML. LATEX is also a standard target encoding for Optical Character Recognition (OCR) techniques [406, 411], which retrieve mathematical expressions from images or PDF fles (see Section 2.2.4). LATEX focus solely on the representation of math (similar to presentation MathML). Additionally, recent studies try to explore the capabilities of trained vector representations of L ATEX expressions [121, 15, 215, 360, 400, 404] to explore new similarity measure and search engines [404], classifcation approaches [404], and even automatically generating new LATEX expressions [400]. Nonetheless, the efectiveness of capturing the semantic information with these methods is controversial [9].

**LATEX to CAS converters** Most relevant for our task are existing translation approaches directly from LATEX to CAS sytanxes. These translators can be categorized in two groups: (1) CAS internal import functions and (2) external programs for specifc or multiple CAS. Mathematica [391] and SymPy [357] are two CAS with the ability to import LATEX expressions directly. SymPy's import function was ported from the external latex2sympy<sup>13</sup> project. Examples of external tools are SnuggleTeX [251] and our in-house translator LACAST [3, 13]. SnuggleTeX is a LATEX to MathML converter with the experimental feature to perform translations to the CAS Maxima [324]. LACAST is the predecessor project of this thesis and focused on translating semantic LATEX from the DLMF to the CAS Maple.

All of these converters are rule-based translators, i.e., they perform translations on hardcoded pre-defned conversion rules. SnuggleTeX support translations to Maxima since version 1.1.0 [251]. The tool allows users to manually predefne translation rules, such as interpreting *e* as the mathematical constant, Γ as the Gamma function, or *f* as a general function. SnuggleTeX is no longer actively maintained and mostly fail to translate general expressions. The developers themselves declare the translation to Maxima as experimental and limited14. SymPy, in contrast,

<sup>13</sup>The project is therefore no longer actively developed but still available on GitHub: https://github.com /augustt198/latex2sympy [accessed 2021-10-01]

<sup>14</sup>https://www2.ph.ed.ac.uk/snuggletex/documentation/semantic- enrichment.html [accessed 2021-10-01]

is actively maintained and provide a more sophisticated import function for LATEX expressions. SymPy's import function parses a given LATEX expression via ANTLR<sup>15</sup> and traverses through the parse tree to convert each token (and subtree) into the SymPy syntax. SymPy uses a set of heuristics that mostly cover standard notations, including \sin. Additionally, it uses pattern matching approaches to identify typical mathematical concepts, such as the derivative notation in *<sup>d</sup> dx* sin(*x*). Similarly, LACAST frst parses the input expression with the Part-of-Math (POM) tagger [402] and performs translations by traversing through the parse tree. The POM tagger tags tokens with additional information from external lexicon fles. LACAST manipulates these lexicon fles to tag tokens with their appropriate translation patterns. LACAST takes the translation patterns attached to a single token and flls them with with the following and preceding nodes in the parse tree to perform a translation. Within this thesis, we will extend LACAST further with pattern matching techniques and human-inspired heuristics to perform more general formulae, including the derivative notation example, sums, products, and other operators. A more detailed discussion about the frst version of LACAST is available in [13].

While SymPy and SnuggleTeX are open source and allows interested readers to analyze the internal implementation details, we can only speculate about the solutions in proprietary software, such as Mathematica. As we saw in Table 1.2 (and later in Chapter 4), Mathematica seems to follow a pattern recognition approach to link known notations, such as *<sup>P</sup>*(*α,β*) *<sup>n</sup>* (*x*), to their internal counterparts, such as JacobiP[n, \[Alpha], \[Beta], x]. Since Mathematica (nor does any other CAS or mentioned converter) analyze the textual context of a formula, importing ambiguous notations generally fail. Since the internal logic (and therefore the underlying patterns) is hidden, it is difcult to estimate the accuracy and power of Mathematica's LATEX import function. As an alternative to Mathematica itself, one can use WolframAlpha<sup>16</sup> [309]. WolframAlpha is described as a knowledge or answer engine. Technically, WolframAlpha is a web interface which uses Mathematica as backbone for computations. WolframAlpha performs numerous of pre-processing and interpretation steps to allow users to generate scientifc information without inputting specifc Mathematica syntax [64, 383].

Table 2.2 compares the converters on our introduction examples (see Table 1.2). The table contains also LACAST frst version (published in 2017 [3]) for comparison. We observe that WolframAlpha clearly performs best on this simple general inputs. The reason is that WolframAlpha focus on a broad, less scientifc audience which allows the system to make several assumptions. On more topic specifc inputs, such as *<sup>P</sup>*(*α,β*) *<sup>n</sup>* (cos(*a*Θ)), it fails. This is further underlined by the fact that Mathematica itself has no trouble interpreting *<sup>P</sup>*(*α,β*) *<sup>n</sup>* (cos(*a*Θ)). This indicates that both systems are optimized for their expected user groups. On these simple cases, SymPy also performs better compared to Mathematica. However, SymPy's size and support of special functions is not comparable with Mathematica and therefore falls behind Mathematica on a more scientifc dataset, such as the DLMF.

A more sophisticated evaluation on 100 randomly selected DLMF formulae revealed that Mathematica can be considered the current state-of-the-art for translating LATEX to CAS. Nonetheless, it only translated 11 cases correctly compared to 7 successful translations by SymPy and 22 by L ACAST. The full benchmark is available in Table E.1 in Appendix E.1 available in the electronic supplementary material.

<sup>15</sup>ANother Tool for Language Recognition (ANTLR): https : / / www . antlr . org / index . ht m l [accessed 2021-10-01]

<sup>16</sup>Often stylized with Wolfram|Alpha



Since LATEX can be easily extended with new content via macros, some projects try to semantically enhance LATEX with unambiguous commands. The two most comprehensive projects are semantic LATEX and STEX.

### **2.2.2.2 Semantic/Content LaTeX**

```
 The Jacobi polynomial in LATEX and semantic LATEX
1 P_n^ {( \alpha , \beta )}(x) % Generic LaTeX
2 \JacobipolyP {n} { \alpha } { \beta }@{x} % Semantic LaTeX
```
Listing 2.2: The Jacobi polynomial in LATEX (line 1) and semantic LATEX (line 2).

Semantic LATEX (also known as content LATEX) was developed by Bruce Miller [260] at the National Institute of Standards and Technology (NIST) to semantically enhance the equations in the DLMF [403]. Essentially, semantic LATEX is a set of custom LATEX macros which are linked to unique defnitions in the DLMF. Consider for example the Jacobi polynomial in Listing 2.2. The general LATEX expression does not contain any information linked to the Jacobi polynomial. However, semantic LATEX replaces the general expression with a new macro \JacobipolyP which is linked to the DLMF [98, (18.3#T1.t1.r2)] 17. In addition, all variable arguments (parame-

<sup>17</sup>Hereafter, we refer to specifc equations in the DLMF by their labels. The label can be added to the base URL of the DLMF. For example, the sine function is defned at 4.14.E1, which can be reached via https : //dlmf.nist.gov/4.14.E1 [accessed 2021-10-01].

ters and variables) are separated and ordered following the function command. This separation is essential to disambiguate notations. For example, the sine function is sometimes written without parenthesis, such as sin *x*, resulting in ambiguous semantic notations, such as in sin *x* + *y*. The semantic LATEX macros allow to visualize this expression but encode it unambiguously via \sin@@{x+y} (which is rendered as sin *x*+*y*). Originally, the semantic LATEX helped to develop a reliable search engine for the DLMF [260]. Nowadays, the macros are also in use in other projects and have been even extended for the Digital Repository of Mathematical Formulae (DRMF) [77, 78], an outgrowth of the DLMF.

Semantic LATEX will play a crucial role in the rest of this thesis because it allows us to stick with the easily maintainable syntax of LATEX but semantically elevates the information of math expressions to a level that can be exploited for translations towards CAS [3, 8, 13]. The main reason is that the semantic LATEX macros mostly cover OPSF from the DLMF. OPSF are a set of functions and polynomials which are generally considered as important, such as the trigonometric functions (also categorized as elementary functions), the Beta function, or orthogonal polynomials. Most OPSF have more or less well-established names and standard notations. The DLMF (i.e., especially the original book [276]) is considered a standard reference for OPSF [381]. General-purpose CAS, such as Mathematica and Maple, focus also on the comprehensive support of OPSF [381]. Hence, semantic LATEX macros play a crucial role for translations from LATEX to CAS syntaxes. Since CAS syntaxes are programming languages, CAS can be extended with new code. However, translating new math formulae to CAS can become arbitrarily complex. Consider the prime counting function would be not supported by Mathematica. In this case, *π*(*x* + *y*) cannot be translated to a simple mathematical formula in the syntax of Mathematica but would require entire new subroutines. Therefore, a comprehensive, viable, and reliable translator from LATEX to the syntax of CAS should maximize its support for OPSF in order to be useful.

Defnition 2.1 provides a brief defnition for the elements of a semantic macro. While the semantic source of the DLMF is publicly available [403], the actual defnitions, i.e., the LATEX style fles, of the macros, are still private18. B. Miller provided access to the defnitions of the macros for this thesis. Later in this thesis, we will rely on additional meta-information given for each semantic macro. This includes default parameters and variables, a short textual description, and links to the DLMF CD [258]. Further information is not explicitly given in the macro defnition fles. For example, function constraints, domains, branch cut positions, singularities, and other properties are only given in the DLMF.

As previously mentioned, we<sup>19</sup> developed LACAST for translating semantic LATEX DLMF formulae to CAS [3, 13]. The frst version did not contain any disambiguation steps or pattern matching approaches to deduce the intended meaning of an expression. Instead, if fully relied on the semantic LATEX macros to perform translations to Maple. For example, sums or products were not supported directly but required the semantically enhanced macros from the DRMF [77, 78]. The source of LACAST is not yet publicly available20 due to the dependency to the POM tagger [402] and the semantic LATEX macros [260, 403] but accessible via open API endpoints21.

<sup>18</sup>As of 2021-10-1.

<sup>19</sup>The frst version of LACAST was the subject of my Master's thesis and laid the foundation for a reliable translation from semantic LATEX to multiple CAS. 20As of 2021-10-01.

<sup>21</sup>The API contains a Swagger UI and is reachable at https://vmext- demo.formulasearchengine.com [accessed 2021-10-01]. LACAST is available under math/translation path (in the math controller). The experimental

### **Definition 2.1: The elements of a semantic macro**

A semantic LATEX macro is a LATEX macro with a unique name followed by a number of arguments. Certain elements of the following arguments are optional but the order remains the same. While a caret and primes are interchangeable, each order would have a diferent meaning, as it can be seen in the example below.

### **A semantic macro and its arguments:**


### **Examples:**

\sin@{x} −→ sin(*x*) \sin@@{x} −→ sin *x* \BesselJ{\nu}''^2@{z} −→ *J*--2 *<sup>n</sup>* (*z*) \BesselJ{\nu}^2''@{z} −→ (*J*<sup>2</sup> *n*) --(*z*) \genhyperF{2}{1}@{a,b}{c}{z} −→ <sup>2</sup>*F*1(*a, b*; *c*; *z*) \genhyperF{2}{1}@@{a,b}{c}{z} −→ <sup>2</sup>*F*<sup>1</sup> *a,b <sup>c</sup>* ; *z* \genhyperF{2}{1}@@@{a,b}{c}{z} −→ <sup>2</sup>*F*1(*z*)

Apart from LACAST, LATExml [257] is another tool that supports semantic LATEX and provides conversions to LATEX, MathML, and a variety of image formats. LATExml was also developed by B. Miller with the original goal to support the development of DLMF [133]. LATExml is a general L ATEX to XML converter. However, in order to support the development of the DLMF, LATExml is able to fully load semantic LATEX defnition fles to convert semantic LATEX into semantically appropriate content MathML. With this ability, LATExml is generally capable of converting other L ATEX encodings too, such as the following STEX.

### **2.2.2.3 sTeX**

STEX refers to semantic TEX and should not be confused with B. Miller's semantic LATEX. STEX was developed around 2008 [194, 195, 200] with the goal to semantically annotate LATEX documents with semantic macros. Specifcally, STEX should serve as a source format to generate the semantic document format OMDoc. While the underlying motivation and technical solution of STEX and semantic LATEX are very similar, there are some core diferences between both formats. Semantic L ATEX was developed specifcally for the DLMF and, therefore, provide semantic macros for OPSF. In particular, a semantic macro in the DLMF represents a specifc unique function. In turn, STEX aim to cover general mathematical notations and provide a logic to semantically annotate general functions and symbols. Consider the aforementioned example *π*(*x* + *y*). If *π* is referring to the prime counting function, we can resolve the ambiguity with semantic L ATEX via \nprimes@{x+y} since the semantic macro \nprimes is referring to that function.

fag performs pattern matching approaches described later in this thesis. The label allows to specify a DLMF equation label to perform specifc assumptions (e.g., that *i* is an index and not the imaginary unit).

In STEX, an author can use modules and IDs to defne the function and set the notation via \symdef{\pi}[1]{\prefix{\pi}{#1}}. While this makes the interpretation of *π*(*x* + *y*) unambiguous, an underlying defnition is still missing. Hence, STEX provides the option to link symbols with their defnitions in the document. This defnition linking underlines the original motivation and connection to the semantic document format OMDoc.

Since STEX is not limited to specifc domains, we could defne any notation we want in our semantic document. On the other hand, this generalizability of STEX makes the format more verbose and somehow similar to a programming language. In STEX, we need to defne and declare symbols explicitly. In addition, a defned new symbol still needs to be manually linked to an underlying defnition. In semantic LATEX, the macro itself is linked to the appropriate defnition in the DLMF. STEX provide access to predefned sets of macros that aim to cover K-14 mathematics [195].

In conclusion, STEX is fexible but verbose. The format is useful when it comes to annotating a general mathematical document semantically. However, the strength of STEX, for example, the ability to defne any symbol with specifc semantics, is generally not very important for translations to CAS. CAS have a fxed set of supported functions and often try to mimic common notation styles, e.g., one does not need to defne − as a unary postfx operator in −2. In turn, a translation from LATEX to CAS faces the issue of identifying the name of the functions involved, its arguments, and the appropriate mappings to counterparts in CAS syntax. Semantic LATEX, on the other hand, provides a syntax that makes it easy to solve these issues. The name of the function is directly encoded in the name of the macro, the arguments are explicitly declared and distinguishable (by curly brackets), and a mapping to an appropriate counterpart in the CAS can be more easily found due to the large overlap of functions in the DLMF and supported functions in CAS.

As previously mentioned, LATExml [257] is able to load TEX defnition fles and support conversion to XML encodings. Hence, LATExml can transform STEX expressions to content MathML[200]. The ability to link STEX symbols with their defnitions in a document or external source further makes it to a source for generating entire semantic enhanced OMDoc documents [195]. STEX could be also used as an alternative to semantic LATEX for translations to CAS. However, due to the natural overlap of functions in the DLMF and CAS, at some point in the development of a translation process on STEX, we would create semantic enhanced macros for OPSF similar to the existing semantic LATEX macros. Hence, using STEX in comparison to semantic LATEX has no direct advantages to perform translations towards CAS. The higher fexibility of STEX makes it a good candidate for translations beyond OPSF.

### **2.2.2.4 Template Editors**

Since LATEX is an interpretable language with over ten thousand mathematical symbols alone [280], learning LATEX syntax is often simply too time-consuming and complex for many users. To provide an easier access to rendered mathematics, especially in so-called *what you see is what you get* (WYSIWYG) editors, such as Microsoft's Ofce programs22 or Wikipedia's visual article editor23, template editors become the norm. Template editors provide visual templates

**31**

<sup>22</sup>https://support.microsoft.com/en-us/ofce/

equation-editor-6eac7d71-3c74-437b-80d3-c7dea24fdf3f [accessed 2021-10-01]

<sup>23</sup>The wikipedia's article about formula editors (https : / / en . wikipedia . org / wiki / For m ula \_ editor [accessed 2021-10-01]

Figure 2.2: The math template editor of Microsoft's Word [395].

of standard mathematical notations so that the user only needs to fll in the remaining spaces. Figure 2.2 shows the template editor of Microsoft's Word [395] for a snippet of the templates for sums. Modern graphic interfaces of CAS also often contain such template editors to improve the user experience further. In comparison to LATEX, template editors are generally easier to use but limited to the ofered templates. Hence, for more complex expressions, template editors are often described as confning [273]. Template editors do not introduce a new math format. The editors only provide a diferent input method but encode the mathematical formulae in system-specifc formats, such as MathML in Microsoft's Word or Maple syntax in Maple.

### **2.2.3 Computable Formats**

So far, we have covered the major formats that focus on the presentation of mathematical expressions and on formats that capture the semantics. Even though formats like content MathML, OpenMath, and the semantic LATEX extensions can resolve the ambiguity of math formulae, they are not computable formats, i.e., we cannot perform actual calculations and computations on them. The syntax of a computable format is a formal language in which every word is linked to specifc subroutines. Much like programming languages, computable formats are semantically unambiguous and interpretable. In turn, computable formats are generally part of a larger software package that ships an interpreter to parse inputs and an engine that performs the computations. In the following, we briefy discuss CAS and theorem prover formats as examples of computable formats. We will not specifcally focus on math packages for specifc programming languages, such as C++ [168], Python [252] or Java [79]. Most CAS and theorem provers, however, internally rely on those *lower-level* packages to some degree.

### **2.2.3.1 Computer Algebra Systems**

A CAS is a mathematical software that can perform a variety of mathematical operations on math inputs, such as symbolic manipulations, numeric calculations, plotting and visualization, simplifcation, and many more [76, 81, 128, 413]. With the increasing power of computers, CAS became a crucial part of the modern scientifc world [32, 262, 352, 356] and are widely used for mathematical problem solving [49, 51, 127, 216, 414], simulations [46, 142, 166, 265, 294], symbolic manipulations [115, 325], and even for teaching students from schools to universities [158, 237, 244, 350, 363, 365, 389, 390]. Due to their complexity, CAS are often large and expensive proprietary software packages [36, 164, 393]. However, there are several well-known open source options available [42], such as SymPy [252], Axiom24 [176], and Reduce<sup>25</sup> [151]. Many CAS focus on specifc domains or mathematical tasks, such as Cadabra [289, 290, 291] (tensor feld theory), FORM [372] (particle physics), GAP [177] (group theory and combinatorics), PAR-I/GP [283] (number theory), or MATLAB [164] (primarily for numeric computation). In contrast, general-purpose CAS, including Mathematica [393], Maple [36], Axiom [176], SymPy [178, 252], Maxima [264, 324], or Reduce [151], aim to provide a large set of tools and algorithms that are benefcial for many mathematical applications. Therefore, general-purpose CAS support a large number of OPSF, since these functions and polynomials are used in a large variety of diferent scientifc felds, from pure and applied mathematics to physics and engineering. Therefore, we primarily focus on translations to general-purpose CAS in this thesis rather than to domain-specifc CAS.

The input formats of general-purpose CAS are often multi-paradigm programming languages [88], i.e., they combine multiple standard programming features, such as functional, mathematical, and procedural approaches. Major CAS generally use their own input language, such as the *Wolfram Language* in Mathematica [392]. Like any programming language, the input format must be unambiguous to the underlying parser of the CAS so that every keyword is uniquely linked to subroutines in the CAS engine. This link to a subroutine makes the expression computable. In contrast, the semantic LATEX macros are linked to theoretical mathematical concepts defned in the DLMF but not with specifc implementations. Hence, a translation to a CAS syntax requires to link mathematical notations, e.g., Γ(*z*), that refer to specifc mathematical concepts, e.g., the Gamma function, to the correct sequence of keywords in the CAS, e.g., GAMMA(z) in Maple.

Since computable languages naturally encode the highest level of semantic information in their expressions, a translation towards other systems that encode less semantic information is possible with a comprehensive list of simple mapping rules. Many CAS therefore provide a variety of diferent output formats, from LATEX to MathML (including content MathML) and images. Translations between CAS or other mathematical software, such as theorem prover, require more sophisticated mappings due to system-specifc implementations [110]. From 2006 to 2011, a joint research project funded by the European Union with over 3 Million Euro launched intending to improve the symbolic computation infrastructure for Europe26. The result of the *SCIEnce project* was the *Symbolic Computation Software Composability Protocol* (SCSCP) [119, 361], which uses the OpenMath encoding to transfer mathematical expressions. Using the SCSCP, interfaces for GAP [206], KANT [120], Maple [243], MuPAD [155], Mathematica [44], and Macaulay2 [311] were implemented.

Note that there are solutions available that do not require any translation between LATEX and CAS. For example, the CAS syntax of Cadabra [291] is a subset of TEX itself. Similarly, SageTeX27 is a LATEX package that allows authors to enter SageMath [317] expressions into LATEX documents, turning the document into an interactive document [201] to some degree. SageMath is a generalpurpose CAS that relies on existing solutions for domain-specifc tasks, such as GAP [177] for group theory or PARI/GP [283] for number theory problems. These solutions do not require

<sup>24</sup>Open source since 2001 (frst released in 1965).

<sup>25</sup>Open source since 2008 (frst released in 1963).

<sup>26</sup>EU FP6 project 026133: https://cordis.europa.eu/project/id/26133/ [accessed 2021-10-01]

<sup>27</sup>https://doc.sagemath.org/html/en/tutorial/sagetex.html [accessed 2021-10-01]

translations since the input must be provided in the syntax of the CAS. Hence, a translation must be performed manually or via external tools.

In the introduction, we mentioned potential issues of CAS with multi-valued functions. Multivalued functions map values from a domain to multiple values in a codomain and frequently appear in the complex analysis of elementary and special functions [8]. Prominent examples are the inverse trigonometric functions, the complex logarithm, or the square root. All modern CAS<sup>28</sup> compute multi-valued functions on their principle branches which makes these functions efectively single-valued (e.g., a calculator always returns <sup>2</sup> for <sup>√</sup><sup>4</sup> rather than <sup>±</sup><sup>2</sup> or just <sup>−</sup>2). The correct properties of multi-valued functions on the complex plane may no longer be valid by their counterpart functions on CAS, e.g., (1*/z*)*<sup>w</sup>* = 1*/*(*zw*) for *z, w* <sup>∈</sup> <sup>C</sup> and *<sup>z</sup>* = 0 is no longer valid within CAS. The positioning and handling of branch cuts in CAS is often discussed in scientifc articles and generally prominantly noticed in CAS handbooks [83, 84, 91, 108, 171, 172]. However, especially in more complex scenarios, it is easy to lose track of branch cut positioning and evaluate expressions on incorrect values. We provide a more complex example and a more detailed explanation of branch cuts in Appendix A available in the electronic supplementary material. To the best of our knowledge, no available translation tool from, to, or between CAS (including the SCSCP solutions) consider branch cut positions.

### **2.2.3.2 Theorem Prover**

The idea of automated reasoning and deduction systems is as old as computers [147]. With the power of computers and a strict axiomatic approach as in *Principia Mathematica* [385], computers can perform automatic reasoning steps to discover and proof new mathematical theorems. Up until today, automated theorem proving and verifying is an extensive research area with an ever-growing interest [266, 354, 384]. There are numerous theorem provers and proof assistants systems available, such as HOL Light [146], HOLF [340], or Isabelle [287]. However, focusing on the deduction, the encoding of theorem provers generally goes beyond mathematical expressions. The syntax provides specifc options for assumptions, links between multiple concepts, and logical steps. An example of a proof by Isabelle, which clearly visualizes the diferent notation of theorem provers and CAS, is given in Appendix C available in the electronic supplementary material.

Nonetheless, theorem prover formats are computable formats with specifc mathematical applications. Hence, there is a genuine interest in transferring fndings and solutions from one system to the other. There are some translation approaches between theorem prover and CAS available, from direct translations [28, 148] to translations over OpenMath [57, 338] and OM-Doc [152]. Theorem provers are generally unable to *compute* a single mathematical formula in the sense of numeric computations or symbolic manipulations. Hence, we do not choose theorem provers as the target computable format for our desired translation process.

### **2.2.4 Images and Tree Representations**

In the following, we briefy discuss formats with the specifc visualization focus: images and tree representations. Especially older literature is often only available in digital scans, and many copies of publications do not provide access to the original LATEX source. Images can be con-

<sup>28</sup>The authors are not aware of any example of a CAS which treats multi-valued functions without adopting principal branches.

sidered as the purest presentational format of mathematical expressions. Tree representations of math expressions, on the other hand, are more theoretical concepts to visualize the logical or presentational structure of math. Tree representations are primarily used for explanation purposes to underline or visualize an idea or concept. Parse trees, as a generated specifc tree format of mathematical string inputs, on the other hand, play a crucial role in almost every mathematical software tool. Often, digital mathematical formats try to mimic the logical tree structure of math expressions. This is also one of the reasons why the web formats (MathML and OpenMath) use XML to encode mathematical content.

**Symbolic Layout, Operator, Parse, and Expression Trees** Mathematical expressions are often represented in tree structures. For example, MathML itself is an XML tree data structure. Moreover, mathematicians often have a logical but theoretical tree representation of a formula in mind in which numbers and identifers are terminal symbols (leaves) and children of math operators, functions, and relations [192, 331]. These so-called *expression trees* are more or less a theoretical structure and are mainly used to visualize logical correlations and connections in mathematical expressions. Schubotz et al. [331] attempted to automate the visualization process of expression trees based on cross-referenced MathML data which resulted in VMEXT, a visualization tool for MathML. Figure 2.3 shows a possible expression tree visualization for the Jacobi polynomial defnition in terms of the hypergeometric function.

Figure 2.3: An expression tree representation of the explicit Jacobi polynomial defnition in terms of the hypergeometric function.

For visualization and education purposes, these tree representations can be benefcial. However, generating these trees requires a deep understanding of the logical structure of the expression. In addition, there is no exact defnition available for expression trees. Hence, the exact visualization is often up for discussions, e.g., whether parameters are children similar to variables or part of the function node itself [9]. A missing standard defnition makes expression trees unreliable and, therefore, less practical for a mathematical encoding.

**Parse Trees** Parse trees are generated tree representations of source expressions (strings). These trees are generated by a parser that follows a strict set of rules, e.g., a context-free grammar [101, 188, 298]. Mathematical LATEX (as a subset of TEX) considering a couple simplifcations

**35**

(e.g., no re-defned standard literals and macros) can also be described in a context-free grammar [402] even though TEX itself is Turing complete [133, 135, 187]. The POM tagger [402], for example, parses mathematical LATEX following a context-free grammar. Similarly, Chien and Cheng [71] build a custom context-free grammar parser for their semantic tokenization of mathematical LATEX expressions. LATExml follows the more sophisticated TEX-like digestion methods [187] to parse entire TEX fles [133, 135]. CAS inputs are parsed internally for further processing [138, 392]. Maple's internal parser also generates a parse tree in which equivalent nodes are merged together for more efcient memory usage (mathematically speaking, this data structure is no longer a valid tree but instead a directed, acyclic graph, or simply DAG) [3, 13].

In contrast to theoretical tree representations, such as the mentioned expression trees, parse trees are crucial for many applications because a tree data format is more easy to process due to their structural logic [93, 242, 286, 406]. While string sequences of commands may contain ambiguities, tree data structures are unique and provide easy access to single logical nodes, groups of nodes, and their dependencies. Hence, parsing a mathematical input (such as in CAS inputs or LATEX expressions) is typically the frst step in any processing pipeline. Later in this thesis, we will also take advantage of tree representations by defning a translation between math formats as graph transformations on their tree representations. To generate a tree representation of mathematical LATEX formats, we can either build a custom parser [71] or rely on existing parsers, such as LATExml [257] or the POM tagger [402]. Parse trees (and other custom generated tree formats that are generated by analyzing a given input) can also be categorized into symbol layout trees (for presentational formats) and operator trees (for content/semantic formats) [406]. For example, parsing LATEX may result in a symbol layout tree that describes the visual structure of formulae while parsing semantic LATEX (or CAS inputs) may result in operator trees which describe the logical mathematical structure of the input.

**Images** From pixel graphics (e.g., JPEG or PNG) to vector graphics (e.g., Scalable Vector Graphics (SVG)) and document formats (e.g., PDF), mathematical expression can appear in a variety of diferent image formats. The two-dimensional structure of mathematics makes drawing mathematical formulae on a sheet of paper or touch screens the most intuitive input method for mathematics. In addition, with rising digitization, scans of old scientifc articles are no longer the only source of math images. Handwriting systems are more and more adopted in ofces and educational institutions [411]. In 2016, Wikipedia switched from non-scalable PNG images to vector graphics for visualizing mathematics [17] (see Appendix B available in the electronic supplementary material, for a more sophisticated overview of the history of math formulae in Wikipedia).

However, image formats are not directly interpretable and are, therefore, less machine-readable. Hence, the frst step of analyzing mathematics in images is always converting into a more machine-readable, digital format. The majority of conversion approaches, including handwriting recognition and Optical Character Recognition (OCR), focus on translations to MathML or LATEX [373, 406, 411]. Hence, for our task (translating presentational formats to computable formats), starting with image formats is not practically useful.

Nonetheless, one particular issue in math OCR is also of interest for our translation task: detection of inline mathematics. In image formats, detecting inline mathematics is difcult because formulae may blend into texts [74, 125, 126, 230, 398]. Even a detection of italic fonts can be a challenging task [66, 112, 113, 233]. A variable can easily be confused with words, such as the Latin letter '*a*.' A similar issue raises in other formats, including LATEX documents and Wikipedia articles when an author does not correctly annotate mathematical formulae. In Wikipedia, for example, single identifers in a text are often put in italic font rather than in mathematical environments. The capability of using UTF-8 encodings incites Wikipedia editors to put inline mathematics into the text directly, even when special characters are involved. For example, the mathematical expression 0 ≤ *φ* ≤ 4*π* in the English Wikipedia article about Jacobi polynomials<sup>29</sup> is a sequence of UTF-8 characters and thus challenging to identify as mathematics for MathIR parser. Nevertheless, identifying all mathematical expressions in a document might be necessary for more reliable translations towards computable formats. For example, the mentioned relation of *φ* defnes the domain of the Wigner d-matrix and is of interest for automatic evaluations (see Chapter 5).

### **2.2.5 Math Embeddings**

Word embedding techniques has received signifcant attention over the last years in the Natural Language Processing (NLP) community, especially after the publication of word2vec [256]. Therefore, more and more projects try to adapt this knowledge for solving tasks in the MathIR arena [121, 15, 141, 215, 353, 360, 400, 404]. These projects try to embed math expressions into natural languages to create a vector representation of the formula. A vector representation is the data format with the highest machine readability among all other representations of mathematical formula. The math embeddings successfully enabled a new approach to measure the similarity between math expressions, which is especially useful for math search, classifcation, and similar tasks [121, 215, 400, 404].

Considering the equation embedding techniques in [215], we devise three main types of mathematical embedding: *Mathematical Expressions as Single Tokens*, *Stream of Tokens*, and *Semantic Groups of Tokens*. In the following we briefy explain each type on an example expression containing the inequality for Van der Waerden numbers

$$W(2,k) \geqslant 2^k/k^\varepsilon. \tag{2.1}$$

This expression is the frst entry in the the MathML benchmark [18] we are going to explain in detail in Section 2.3.

**Mathematical Expressions as Single Tokens** So called equation embeddings (EqEmb) were introduced by Krstovski and Blei [215] and use an entire mathematical expression as one token. In a one-token representation, the inner structure of the mathematical expression is not considered. For example, *W*(*r, k*) is represented as one single token *t*1. Any other expression, such as *W*(2*, k*) in the context, is an entirely independent token *t*2. Therefore, this approach does not learn any connections between *W*(2*, k*) and *W*(*r, k*). However, [215] has shown promising results for comparing mathematical expressions with this approach.

**Stream of Tokens** As an alternative to embedding mathematical expressions as a single token, one can also represent an expression through a sequence of its inner elements. For example, considering only the identifers in Equation (2.1), it would generate *W*, *k*, and *ε* as a sequence/stream of tokens. This approach has the advantage of learning all mathematical tokens.

**37**

<sup>29</sup>https://en.wikipedia.org/wiki/Jacobi\_polynomials#Applications [accessed 2021-10-01]

However, this method also has some drawbacks. Complex mathematical expressions may lead to long chains of elements, which can be especially problematic when the window size of the training model is too small. Naturally, there are approaches to reduce the length of chains. Gao et al. [121] use a continuous bag of words (CBOW) approach and embed all mathematical symbols, including identifers and operands, such as +, − or variations of equalities =. Krstovski and Blei [215] also evaluated the stream of tokens approach but do not cut out symbols. They trained their model on the entire sequence of tokens that the LATEX tokenizer generates. Considering Equation (2.1), it would result in a stream of 13 tokens. They use a long short-term memory (LSTM) architecture to overcome the limiting window size and further limit chain lengths to 20 − 150 tokens. Usually, in word embedding, such behaviour is not preferred since it increases the noise in the data.

We [15] also use this stream of tokens approach to train our model on the DLMF without any flters. Thus, Equation (2.1) generates all 13 tokens. Later in Section 3.1, we show another model trained on the arXiv collection, which uses a stream of mathematical identifers and cut out all other expressions, i.e., in case of (2.1), we embed *W*, *k*, and *ε*. We presume this approach is more appropriate to learn connections between identifers and their defniens. We will see later that both of our models trained on math embedding are able to detect similarities between mathematical objects, but does not perform well on detecting connections to word descriptors.

**Semantic Groups of Tokens** The third approach of embedding mathematics is only theoretical. Current MathIR and Machine Learning (ML) approaches would beneft from a basic structural knowledge of mathematical expressions, such that variations of function calls (e.g., *W*(*r, k*) and *W*(2*, k*)) can be recognized as the same function. Instead of defning a unifed standard, current techniques use their ad-hoc interpretations of structural connections. We assume that an embedding technique would beneft from a system that can detect the parts of interest in mathematical expressions before any training process. However, such a system still does not exist. Later in Section 3.2, we will introduce a new concept to interpret logical groups of mathematical objects that may enable a semantic embedding in the future.

It is important to mention that it remains unclear to what degree math semantic information can be embedded in a vector representation [9]. Since there is no answer to this question, we have not included math embeddings (i.e., vector representations of formulae) to Figure 2.1. Nonetheless, a vector representation can be decoded into a CAS syntax representation again to perform a ML based translation [296]. We will elaborate on such an approach more in Chapter 4.

### **2.3 From Presentation to Content Languages**

We introduced several diferent formats for encoding mathematical formulae digitally and provided an overview of several existing conversion tools between these formats. Considering Figure 2.1, the goal of this thesis, i.e., making presentational math computable, requires to convert mathematical formats from the most left of the fgure to the most right. We have chosen LATEX as the source format and general-purpose CAS syntaxes for the target formats. Considering the merit of communicating knowledge in sciences, it comes to no surprise that there are numerous of translation tools and theoretical approaches available to convert math formulae between multiple formats, including our goal translation from LATEX to CAS syntaxes. Since MathML is the web standard which is supported by several CAS at least partially [57, 110, 303, 338] (or OpenMath respectively), a translation from LATEX to CAS could be performed over MathML (preferably content MathML). In this section, we analyze state-of-the-art LATEX to MathML converters to study the applicability of using MathML as an intermediate format for translations from LATEX to CAS syntaxes. This section was previously published [18].

### **2.3.1 Background**

In the following, we use the Riemann hypothesis (2.2) as an example to explain typical challenges of converting diferent representation formats of mathematical formulae:

$$
\zeta(s) = 0 \Rightarrow \Re s = \frac{1}{2} \lor \Im s = 0. \tag{2.2}
$$

We will focus on the representation of the formula in LATEX and in the format of the CAS Mathematica. LATEX is a common language for encoding the presentation of mathematical formulae. In contrast to LATEX, Mathematica's representation focuses on making formulae computable. Hence the content must be encoded, i.e., both the structure and the semantics of mathematical formulae must be taken into consideration.

In LATEX, the Riemann hypothesis can be expressed using the following string:

```
 Riemann hypothesis in LATEX
1 \zeta (s) = 0 \Rightarrow \Re s = \frac 12 \lor \Im s=0
```
In Mathematica, the Riemann hypothesis can be represented as:

The conversion between these two formats is challenging due to a range of conceptual and technical diferences.

First, the grammars underlying the two representation formats greatly difer. LATEX uses the unrestricted grammar of the TEX typesetting system. The entire set of commands can be redefned and extended at runtime, which means that TEX efectively allows its users to change every character used for the markup, including the \ character typically used to start commands. The large degree of freedom of the TEX grammar signifcantly complicates recognizing even the most basic tokens contained in mathematical formulae. In diference to LATEX, CAS use a signifcantly more restrictive grammar consisting of a predefned set of keywords and set rules that govern the structure of expressions. For example in Mathematica, function arguments must always be enclosed in square brackets and separated by commas.

Second, the extensive diferences in the grammars of the two languages are refected in the resulting expression trees. Similar to parse trees in natural language, the syntactic rules of mathematical notation, such as operator precedence and function scope, determine a hierarchical

**39**

structure for mathematical expressions that can be understood, represented, and processed as a tree. The mathematical expression trees of formulae consist of functions or operators and their arguments. We used nested square brackets to denote levels of the tree and Arabic numbers in a gray font to indicate individual tokens in the markup. For the LATEX representation of the Riemann hypothesis, the expression tree is:

The tree consists of 18 nodes, i.e., tokens, with a maximum depth of two (for the fraction command \frac12). The expression tree of the Mathematica expression consists of 16 tokens with a maximum depth of fve:

The higher complexity of the Mathematica expression refects that a CAS represents the content structure of the formula, which is deeply nested. In contrast, LATEX exclusively represents the presentational layout of the Riemann hypothesis, which is almost linear.

For the given example of the Riemann hypothesis, fnding alignments between the tokens in both representations and converting one representation into the other is possible. In fact, Mathematica and other CAS ofer a direct import of TEX expressions, which we evaluate in Section 2.3.3.

However, aside from technical obstacles, such as reliably determining tokens in TEX expressions, conceptual diferences also prevent a successful conversion between presentation languages, such as TEX, and content languages. Even if there was only one generally accepted presentation language, e.g., a standardized TEX dialect, and only one generally accepted content language, e.g., a standardized input language for CAS, an accurate conversion between the representation formats could not be guaranteed.

The reason is that neither the presentation language, nor the content language always provides all required information to convert an expression to the respective language. This can be illustrated by the simple expression: *F*(*a* + *b*) = *F a* + *F b*. The inherent content ambiguity of *F* prevents a deterministic conversion from the presentation language to a content language. *F* might, for example, represent a number, a matrix, a linear function or even a symbol. Without additional information, a correct conversion to a content language is not guaranteed. On the other hand, the transformation from content language to presentation language often depends on the preferences of the author and the context. For example, authors sometimes change the presentation of a formula to focus on specifc parts of the formula or improve its readability.

Another obstacle to conversions between typical presentation languages and typical content languages, such as the formats of CAS, are the restricted set of functions and the simpler grammars that CAS ofer. While TEX allows users to express the presentation of virtually all mathematical symbols, thus denoting any mathematical concept, CAS do not support all available mathematical functions or structures. A signifcant problem related to the discrepancy of the space of concepts expressible using presentation markup and the implementation of such concepts in CAS are branch cuts. Branch cuts are restrictions of the set of output values that CAS impose for functions that yield ambiguous, i.e., multiple mathematically permissible outputs. One example is the complex logarithm [98, ( 4.2.1)], which has an infnite set of permissible outputs resulting from the periodicity of its inverse function. To account for this circumstance, CAS typically restrict the set of permissible outputs by cutting the complex plane of permissible outputs. However, since the method of restricting the set of permissible outputs varies between systems, identical inputs can lead to drastically diferent results [3]. For example, multiple scientifc publications address the problem of accounting for branch cuts when entering expressions in CAS, such as [109] for Maple.

Our review of obstacles to the conversion of representation formats for mathematical formulae highlights the need to store *both* presentation and content information to allow for reversible transformations. Mathematical representation formats that include presentation and content information can enable the reliable exchange of information between typesetting systems and CAS.

MathML ofers standardized markup functionality for both presentation and content information. Moreover, the declarative MathML XML format is relatively easy to parse and allows for cross references between Presentation Language (PL) and Content Language (CL) elements. Listing 2.3 represents excerpts of the MathML markup for our example of the Riemann hypothesis (2.2). In this excerpt, the PL token 7 corresponds to the CL token 19, PL token 5 corresponds to CL token 20, and so forth.

```
 Riemann hypothesis in MathML
1 <math><semantics><mrow>...
2 <mo id="5" xref="20">=</mo>
3 <mn id="5" xref="21">0</mn>
4 <mo id="7" xref="19">⇒</ci>...</mrow>
5 <annotation-xml encoding="MathML-Content">
6 <apply><implies id="19" xref="7"/>
7 <apply><eq id="20" xref="5"/>...
8 <apply><csymbol id="21" xref="1" cd="wikidata">Q187235</csymbol>...
9 </annotation-xml></semantics></math>
```
Combined presentation and content formats, such as MathML, signifcantly improve the access to mathematical knowledge for users of digital libraries. For example, including content information of formulae can advance search and recommendation systems for mathematical content. The quality of these *mathematical information retrieval systems* crucially depends on the accuracy of the computed document-query and document-document similarities. Considering the content information of mathematical formulae can improve these computations by:

Listing 2.3: MathML representation of the Riemann hypothesis (2.2) (excerpt).


Content information could furthermore enable interactive support functions for consumers and producers of mathematical content. For example, readers of mathematical documents could be ofered interactive computations and visualizations of formulae to accelerate the understanding of STEM documents. Authors of mathematical documents could beneft from automated editing suggestions, such as auto completion, reference suggestion, and sanity checks, e.g., type and defniteness checking, similar to the functionality of word processors for natural language texts.

### **2.3.1.1 Related Work**

A variety of tools exist to convert format representations of mathematical formulae. However, to our knowledge, Stamerjohanns et al. [351] presented the only study that evaluated the conversion quality of tools. Unfortunately, many of the tools evaluated by Stamerjohanns et al. are no longer available or out of date. Watt presents a strategy to preserve formula semantics in TEX to MathML conversions. His approach relies on encoding the semantics in custom TEX macros rather than to expand the macros [380]. Padovani discusses the roles of MathML and TEX elements for managing large repositories of mathematical knowledge [278]. Nghiem et al. used statistical machine translation to convert presentation to content language [271]. However, they do not consider the textual context of formulae. We will present detailed descriptions and evaluation results for specifc conversion approaches in Section 2.3.3.

Youssef addressed the semantic enrichment of mathematical formulae in presentation language. They developed an automated tagger that parses LATEX formulae and annotates recognized tokens very similarly to Part-of-Speech (POS) taggers for natural language [402]. Their tagger currently uses a predefned, context-independent dictionary to identify and annotate formula components. Schubotz et al. proposed an approach to semantically enrich formulae by analyzing their textual context for the defnitions of identifers [329, 330].

With their 'math in the middle approach', Dehaye et al. envision an entirely diferent approach to exchanging machine readable mathematical expressions. In their vision, independent and enclosed virtual research environments use a standardized format for mathematics to avoid computions and transfers between diferent systems. [94].

For an extensive review of format conversion and retrieval approaches for mathematical formulae, refer to [326, Chapter 2].

### **2.3.2 Benchmarking MathML**

This section presents MathMLben - a benchmark dataset for measuring the quality of MathML markup of mathematical formulae appearing in a textual context. MathMLben is an improvement of the gold standard provided by Schubotz et al. [329]. The dataset considers recent discussions of the International Mathematical Knowledge of Trust<sup>30</sup> working group, in particular the idea of a 'Semantic Capture Language' [165], which makes the gold standard more robust and easily accessible. MathMLben:


In Section 2.3.2.1, we present the test collection included in MathMLben. In Section 2.3.2.2, we present the encoding guidelines for the human assessors and describe the tools we developed to support assessors in creating the gold standard dataset. In Section 2.3.2.3, we describe the similarity measures used to assess the markup quality.

### **2.3.2.1 Collection**

Our test collection contains 305 formulae (more precisely, mathematical expressions ranging from individual symbols to complex multi-line formulae) and the documents in which they appear.

**Expressions 1 to 100** correspond to the search targets used for the 'National Institute of Informatics Testbeds and Community for Information access Research Project' (NTCIR) 11 Math Wikipedia Task [329]. This list of formulae has been used for formula search and content enrichment tasks by at least 7 diferent research institutions. The formulae were randomly sampled from Wikipedia and include expressions with incorrect presentation markup.

**Expressions 101 to 200** are random samples taken from the NIST DLMF [98]. The DLMF website contains 9,897 labeled formulae created from semantic LATEX source fles [77, 78]. In contrast to the examples from Wikipedia, all these formulae are from the mathematics research feld and exhibit high quality presentation markup. The formulae were curated by renowned mathematicians and the editorial board keeps improving the quality of the formulae's markup31. Sometimes, a labeled formula contains multiple equations. In such cases, we randomly chose one of the equations.

**Expressions 201 to 305** were chosen from the queries of the NTCIR arXiv and NTCIR-12 Wikipedia datasets. 70% of these queries originate from the arXiv [22] and 30% from a Wikipedia dump.

<sup>30</sup>http://imkt.org/ [accessed 2021-08-03]

<sup>31</sup>http://dlmf.nist.gov/about/staff [accessed 2021-08-03]

All data is openly available for research purposes and can be obtained from: https://mathm lben.wmflabs.org32.

### **2.3.2.2 Gold Standard**

We provide explicit markup with universal, context-independent symbols in content MathML. Since the symbols from the default content dictionary of MathML33 alone were insufcient to cover the range of semantics in our collection, we added the Wikidata content dictionary [328]. As a result, we could refer to all Wikidata items as symbols in a content tree. This approach has several advantages. Descriptions and labels are available in many languages. Some symbols even have external identifers, e.g., from the Wolfram Functions Site, or from stack-exchange topics. All symbols are linked to Wikipedia articles, which ofer extensive human-readable descriptions. Finally, symbols have relations to other Wikidata items, which opens a range of new research opportunities, e.g., for improving the taxonomic distance measure [336].

Our Wikidata-enhanced, yet standard-compliant MathML markup, facilitates the manual creation of content markup. To further support human assessors in creating content annotations, we extended the VMEXT visualization tool [331] to develop a visual support tool for creating and editing the *MathMLben* gold standard.


Table 2.3: Special content symbols added to LATExml for the creation of the gold standard.

For each formula, we saved the source document written in diferent dialects of LATEX and converted it into content MathML with parallel markup using LATExml [135, 257]. LATExml is a Perl program that converts LATEX documents to XML and HTML. We chose LATExml, because it is the only tool that supports our semantic macro set. We manually annotated our dataset, generated the MathML representation, manually corrected errors in the MathML, and linked the identifers to Wikidata concept entries whenever possible. Alternatively, one could initially generate MathML using a CAS and then manually improve the markup.

Since there is no generally accepted defnition of expression trees, we made several design decision to create semantic representations of the formulae in our dataset using MathML trees. In some cases, we created new macros to be able to create a MathML tree for our purposes using LATExml34. Table 2.3 lists the newly created macros. Hereafter, we explain our decisions and give examples of formulae in our dataset that were afected by the decisions.

<sup>32</sup>Visit https://mathmlben.wmflabs.org/about for a user guide [accessed 2021-08-03].

<sup>33</sup>http://www.openmath.org/cd [accessed 2021-08-03]

<sup>34</sup>http://dlmf.nist.gov/latexml/manual/customization/customization.latexml.html#SS1. SSS0.Px1 [accessed 2021-08-03]


$$E = mc^2,\tag{\star}$$

the () is the ignored label;


Some of these design decisions are debatable. For example, introducing a new macro, such as \identifiername{}, to distinguish between multi-character identifers and operators might be advantageous to our approach. However, introducing many highly specialized macros is likely not a viable approach and exaggerated. A borderline example in regard to this problem is Δ*x* [GoldID 280]. Formulae of this form could be annotated as \operatorname{}, \identifiername{} or more generally as \expressionname{}. We interpret Δ as a diference applied to a variable, and render the expression as a function call.

Figure 2.4: Graphical User Interface (GUI) to support the creation of our gold standard. The interface provides several TEX input felds (left) and a mathematical expression tree rendered by the VMEXT visualization tool (right).

Similar cases of overfeeding the dataset with highly specialized macros are bracket notations. For example, the bracket (Dirac) notation, e.g., [GoldID 209], is mainly used in quantum physics. The angle brackets for the Dirac notation, and , and a vertical bar | is already interpreted correctly as "latexml - quantum-operator-product". However, a more precise distinction between a twofold scalar product, e.g., *a*|*b*, and a threefold expectation value, e.g., *a*|*A*|*a*, might become necessary in some scenarios to distinguish between matrix elements and a scalar product.

We developed a Web application to create and cultivate the gold standard entries, which is available at: https : / / math m lben.wm flabs . org/. The GUI provides the following information for each Gold ID entry.


Figure 2.4 shows the GUI that allows to manually modify the diferent formats of a formula. While the other felds are intended to provide additional information, the pipeline to create and cultivate a gold standard entry starts with the semantic LATEX input feld. LATExml will generate content MathML based on this input and VMEXT will render the generated content MathML afterwards. We control the output by using the DLMF LATEX macros [260] and our developed extensions. The following list contains some example of the DLMF LATEX macros.


The DLMF web pages, which we use as one of the sources for our dataset, were generated from semantically enriched LATEX sources using LATExml. Since LATExml is capable to interpret semantic macros, generates content MathML that can be controlled with macros, and is easily extensible by new macros, we also used LATExml to generate our gold standard. While the DLMF is a compendium for special functions, we need to annotate every identifer in the formula with semantic information. Therefore, we extended the set of semantic macros.

In addition to the special symbols listed in Table 2.3, we created macros to semantically enrich identifers, operators, and other mathematical concepts by linking them to their Wikidata items. As shown in Figure 2.4, the annotations are visualized using yellow info boxes appearing on mouse over. The boxes show the Wikidata QID, the name, and the description (if available) of the linked concept.

Aside from naming, classifying, and semantically annotating each formula, we performed three other tasks:


Most of the extracted formulae contained concepts to improve human readability of the source code, such as commented line breaks, %\n, in long mathematical expressions, or special macros to improve the displayed version of the formula, e.g., spacing macros, delimiters, and scale settings, such as \!, \, or \>. Since they are part of the expression, all of the tested tools (also LATExml) try to include these formating improvements into the MathML markup. For our

**47**

gold standard, we focus on the pure semantic information and forgo formating improvements related to displaying the formula. The corrected TEX feld shows the cleaned mathematical LATEX expression.

Using the corrected TEX feld and the semantic macros, we were able to adjust the MathML output using LATExml and verify it by checking the visualization from VMEXT.

### **2.3.2.3 Evaluation Metrics**

To quantify the conversion quality of individual tools, we computed the similarity of each tool's output and the manually created gold standard. To defne the similarity measures for this comparison, we built upon our previous work [336], in which we defned and evaluated four similarity measures: taxonomic distance, data type hierarchy level, match depth, and query coverage. The measures taxonomic distance and data type hierarchy level require the availability of a hierarchical ordering of mathematical functions and objects. For our use case, we derived this hierarchical ordering from the MathML content dictionary. The measures assign a higher similarity score if matching formula elements belong to the same taxonomic class. The match depth measure operates under the assumption that matching elements, which are more deeply nested in a formula's content tree, i.e., farther away from the root node, are less signifcant for the overall similarity of the formula, hence are assigned a lower weight. The query coverage measure performs a simple 'bag of tokens' comparison between two formulae and assigns a higher score the more tokens the two formulae share.

In addition to these similarity measures, we also included the tree edit distance. For this purpose, we adapted the robust tree edit distance (RTED) implementation for Java [288]. We modifed RTED to accept any valid XML input and added math-specifc 'shortcuts', i.e., rewrite rules that generate lower distance scores than arbitrary rewrites. For example, rewriting *<sup>a</sup> <sup>b</sup>* to *ab*−<sup>1</sup> causes a signifcant diference in the expression tree: Three nodes (∧*,* −*,* 1) are inserted and one node is renamed ÷→·*.* The 'costs' for performing these edits using the stock implementation of RTED are *c* = 3*i* + *r.* However, the actual diference is an equivalence, which we think should be assigned a cost of *e <* 3*i* + *r.* We set *e < r < i.*

### **2.3.3 Evaluation of Context-Agnostic Conversion Tools**

This section presents the results of evaluating existing, context-agnostic conversion tools for mathematical formulae using our benchmark dataset MathMLben (cf. Section 2.3.2). We compare the distances between the presentation MathML and the content MathML tree of a formula yielded by each tool to the respective trees of formulae in the gold standard. We use the tree edit distance with customized weights and math-specifc shortcuts. The goal of shortcuts is eliminating notational-inherent degrees of freedom, e.g., additional PL elements or layout blocks, such as mrow or mfenced.

### **2.3.3.1 Tool Selection**

We compiled a list of available conversion tools from the W3C<sup>35</sup> wiki, from *GitHub*, and from questions about automated conversion of mathematical LATEX to MathML on *Stack Overfow*. We selected the following converters:

<sup>35</sup>https://www.w3.org/wiki/Math\_Tools [accessed 2021-08-03]


### **2.3.3.2 Testing framework**

We developed a Java-based framework that calls the programs to parse the corrected TEX input data from the gold standard to presentation MathML, and, if applicable, to content MathML. In case of the POM tagger, we parsed the input string to a general XML document. We used the corrected TEX input format instead of the originally extracted string expressions, see 2.3.2.2.

Executing the testing framework requires the manual installation of the tested tools. The POM tagger is not yet publicly available.

### **2.3.3.3 Results**

Figure 2.5 shows the averaged structural tree edit distances between the presentation trees (blue) and content trees (orange) of the generated MathML fles and the gold standard. To

<sup>36</sup>https://www.mathtowebonline.com [accessed 2021-08-03]

<sup>37</sup>https://fred-wang.github.io/TeXZilla [accessed 2021-08-03]

<sup>38</sup>https://github.com/gjtorikian/mathematical [accessed 2021-08-03]

Figure 2.5: Overview of the structural tree edit distances (using *r* = 0*, i* = *d* = 1) between the MathML trees generated by the conversion tools and the gold standard MathML trees.

calculate the structural tree edit distances, we used the RTED [288] algorithm with costs of *i* = 1 for inserting, *d* = 1 for deleting and *r* = 0 for renaming nodes. Furthermore, the Figure shows the total number of successful transformations for the 305 expressions (black ticks). Note that we also consider diferences of the presentation tree to the gold standard as defcits, because the mapping from LATEX expressions to rendered expressions is unique (as long as the same preambles are used). A larger number indicates that more elements of an expression were misinterpreted by the parser. However, certain diferences between presentation trees might be tolerable, e.g., reordering commutative expressions, while diferences between content trees are more critical. Also note that improving content trees may not necessarily improve presentation trees and vice versa. In case of *f*(*x* + *y*), the content tree will change depending whether *f* represents a variable or a function, while the presentation tree will be identical in both cases. In contrast, *<sup>a</sup> <sup>b</sup>* , *<sup>a</sup>*/*<sup>b</sup>*, and *a/b* have diferent presentation trees, while the content trees are identical.

Figure 2.6 illustrates the runtime performance of the tools. We excluded the CAS from the runtime performance tests, because the system is not primarily intended for parsing LATEX expressions, but for performing complex computations. Therefore, runtime comparisons between a CAS and conversion tools would not be representative. We measured the times required to transform all 305 expressions in the gold standard and write the transformed MathML to the storage cache. Note that the native code of LaTeX2MathML, Mathematical and LATExml were called from the Java Virtual Machine (JVM) and Mathoid was called through local web-requests, which increased the runtime of these tools. The fgure is scaled logarithmically. We would like to emphasize that LATExml is designed to translate sets of LATEX documents instead of single mathematical expressions. Most of the other tools are lightweight engines.

**Performance of Tools**

Figure 2.6: Time in seconds required by each tool to parse the 305 gold standard LATEX expressions in logarithmic scale.

In this benchmark, we focused on the structural tree distances rather than on distances in semantics. While our gold standard provides the information necessary to compare the extracted semantic information, we will focus on this problem in future work.

### **2.3.4 Summary of MathML Converters**

We make available the frst benchmark dataset to evaluate the conversion of mathematical formulae between presentation and content formats. During the encoding process for our MathML-based gold standard, we presented the conceptual and technical issues that conversion tools for this task must address. Using the newly created benchmark dataset, we evaluated popular context-agnostic LATEX-to-MathML converters. We found that many converters simply do not support the conversion from presentation to content format, and those that did often yielded mathematically incorrect content representations even for basic input data. These results underscore the need for future research on mathematical format conversions.

Of the tools we tested, LATExml yielded the best conversion results, was easy to confgure, and highly extensible. However, these benefts come at the price of a slow conversion speed. Due to its comparably low error rate, we chose to extend the LATExml output with semantic enhancements.

### **2.4 Mathematical Information Retrieval for LaTeX Translations**

In the following, we will briefy discuss related work in the Mathematical Information Retrieval (MathIR) arena in order to fnd existing practical approaches for a translation from presentational to computable formats. MathIR is the research area that aims to retrieve additional (generally semantic) information about mathematical content [141]. In turn, the task of translating mathematical presentational formats to computable formats is part of this research area since it requires a context-dependent semantifcation39, i.e., the semantic enhancement or enrichment of mathematical objects with additional information. One of the most well-studied tasks in MathIR40 is searching for relevant mathematical expressions or content [21, 22, 241, 346, 405, 408]. However, successful solutions in this area focus on similarity measures and do not necessarily require a deep understanding of the meaning and content of a formula. Likewise, other tasks in MathIR, such as entity linking, use similarity measures to retrieve connections between entities rather than semantic relatedness [208, 319, 321]. Thus, many related work in MathIR is not particularly benefcial for translating presentational encodings to computable formats. One of the reasons for this research gap is presumably a semantic version of the *chicken or the egg* causality dilemma. On the one hand, semantically enriching mathematical objects in an expression require identifying the meaningful objects. On the other hand, identifying those meaningful objects requires semantic information about those objects. In other words, if we want to annotate *<sup>P</sup>*(*α,β*) *<sup>n</sup>* (*x*) with *Jacobi polynomial* in our use case equation (1.1), we need to know that *<sup>P</sup>*(*α,β*) *<sup>n</sup>* (*x*) refers to the Jacobi polynomial.

Figure 2.7 illustrates this issue by splitting a math expression into four layers of mathematical objects. The identifer layer contains all identifers (which may include general symbols and numbers too). The arithmetic layer contains arithmetic structures that combine tokens from the identifer layer to mathematical terms. This layer may include logic terms, sets, and other mathematical concepts with specifc notations. The function layer combines elements from the lower layers to entire function calls. The top expression layer contains entire expressions in documents which are often a composition of elements in the previous layers. The diference of elements in the function and arithmetic layer is the ambiguity of the notations. Elements in the arithmetic layer generally do not need to be mapped to specifc keywords in CAS because they are often semantically unique. In contrast, elements in the function layer are potentially ambiguous. However, a clear distinction between both layers is not always necessary and may even confuse in other MathIR related scenarios. For our task, the distinction is benefcial because elements in the function layer must be mapped to specifc keywords in the CAS syntax, while elements in the arithmetic layer can be mostly ignored.

Existing MathIR tasks focus on semantically enhancing either the expression [208, 209, 215], arithmetic [93, 242, 339], or the identifer [121, 279, 329, 330, 339, 400] layer, missing the important function layer entirely. An algorithm needs to understand the involved functions to identify objects in the function layer. This dilemma is usually avoided in MathIR tasks since objects in the other layers can be extracted primarily context-independently. The meaning of arithmetic operators usually does not change (e.g., +, −, or */*) and math identifers can often be presumed to be Latin or Greek letters. The function layer, however, contains the most crucial objects for the translation task. Identifers generally represent mutable objects, such as variables or parameters, and do not require specifc mapping rules. Similarly, arithmetic operations are natively supported by most mathematical software. Finally, objects in the expression layers are often too abstract (because they are compositions of multiple objects) and cannot be mapped as a whole to a single logic procedure in a computable format.

There are approaches available that try to semantically enrich elements in the function layer. However, most of these semantic enrichment approaches focus solely on mathematical expressions themselves and do not analyze textual information [159, 259, 270, 339, 364, 374].

<sup>39</sup>Also often called *semantic enrichment*.

<sup>40</sup>For an extensive review of retrieval approaches for mathematical formulae, see also [326, Chapter 2].

**Section 2.4.** Mathematical Information Retrieval for LaTeX Translations


Figure 2.7: Four diferent layers of math objects in a single mathematical expression. The red highlights in the function and arithmetic layer refer to the fxed structure (or stem) of the function or operator. Gray tokens are mutable. Elements in the arithmetic layer are generally understood without further mappings and are mostly context-independent while elements in the function layer must be mapped to specifc procedures in CAS and require disambiguation. However a strict distinction is not always required and might be even confusing. For example, *n*! is mostly understood by CAS and context-independent but can (and sometimes should) be mapped to the specifc factorial procedure making it more to an element of the function layer.

Approaches that take the textual context of a formula into account, on the other hand, do not semantically enrich objects in the function layer. Instead, they focus on other specifc applications, including math embeddings with the goal of a semantic vector representation [121, 215, 360, 400, 404], entity linking [208, 212, 316, 321], math word problem solving [285, 409], semantic annotation [183, 214, 279, 329, 330], and context-aware math search engines [93, 122, 124, 145, 210, 211, 232, 273, 314, 315, 366]. Regarding translating mathematical expressions from a lower level of semantics to a higher level, relevant literature is limited. The main relevant related literature for our task include semantic tagging [71, 402], annotations [139, 183, 214, 279, 329, 330], and term disambiguations [339]. In the following, we distinguish semantic tagging (the task of precisely tagging math objects with a pre-defned set of semantic tags) and semantic annotation (the task of adding any number of relevant descriptions to math objects).

**Semantic Tagging and Term Disambiguation** Semantic tagging of mathematical tokens has rarely been studied in the past and has not reached a well-established reliability level yet. To the best of our knowledge, only Chien et al. [71] (2015) and Youssef [402] (2017) addressed the issue for semantic tokenization of math formulae. Youssef [402] created the POM tagger, which tags tokens in the LATEX parse tree with additional information from a manually crafted lexicon. The POM tagger is still a work in progress and does not perform disambiguation steps yet. In the future, it is planned to reduce the number of possible tags for a token by analyzing the textual context and eliminating false tags. Ideally, the extracted context information results in a single, unique tag for each token. However, no update of the POM tagger, including the disambiguation steps, has been published so far. Recently, however, Shan and Youssef [339] presented several machine learning approaches as the frst step towards disambiguation of mathematical terms. They trained diferent models on the semantic DLMF dataset and successfully disambiguated

**53**

prime notations with an *F*1 score of 0*.*83. However, if the models only adapted the relatively strict DLMF notation style for primes or if they are also able to disambiguate other real-world data has not been discussed.

Chien et al. [71] proposed a probabilistic model on entire document collections to conclude semantic tags of mathematical tokens. They focused on tagging single identifers (i.e., no groups of tokens). They constituted that the *consistency property* and *user habits* are critical aspects for successful tag disambiguation. With *user habits*, the authors referred to the diferent education levels and expertise of users so that a model can predict the preferred notation for specifc semantics. The *consistency property* refers to the assumption that the meaning of a single term does not change within a certain context, e.g., a document. Recent eforts on annotating mathematical symbols by Asakura et al. [1], however, indicate that the scope of consistent tags could be signifcantly smaller than an entire document or a document collection. The semantics of frequently used symbols, such as *x* or *t*, may even change within single paragraphs. Another interesting counterexample is the connection between Euler numbers and Euler polynomials [98, (24.2.9)] in

$$E\_n = 2^n E\_n \left(\frac{1}{2}\right). \tag{2.3}$$

While clearly connected, the frst *E* refers to the Euler number but the second *E* refers to Euler polynomials. This underlines that under special circumstances, even within the scope of a single equation, an identifer may refer to two diferent mathematical concepts. Chien et al. reported a maximum accuracy of 0*.*94.

**Semantic Annotation Task** While the task of semantic annotation has been studied more comprehensively, none of these existing approaches tried to convert the source expressions into a computable format [139, 183, 214, 279, 329, 330]. Grigore et al. [139], Nghiem et al. [269], Pagel et al. [279], Schubotz et al. [329, 330], and Kristianto et al. [214] analyze nouns or noun phrases in the surrounding context of a formula to semantically annotate an entire expression or parts of an expression. Only Grigore et al. [139] tried to use this information to perform a translation to a semantically enhanced format, here content MathML. The authors deduced a CD entry for a math symbol by calculating the similarity of the nouns surrounding the symbol and the textual description (or more precisely: the cluster of nouns in that description) of the CD entry. They measured the similarity with distributional properties from WordNet [261]. The other approaches either use the gained semantic information to improve search engines [214, 269] or enable entity linking [279, 329, 330]. While other semantifcation approaches exist that elevate source presentational formats to a semantically enriched format [245, 251, 257, 270, 271, 364, 391], none of them take the textual context into account. Some of them, however, perform disambiguation steps by considering other mathematical expressions in the same document (again presuming a semantic consistency of math notation within a single document as proposed by Chien et al. [71]) [270, 271]. None of the previous work considered the possibility of an identifer that has multiple meanings within a single formula, as shown in equation (2.3).

**Summary** In summary, semantic enriching approaches avoid the essential function layer [159, 259, 270, 364, 374], ignore the textual context surrounding a formula [71, 245, 251, 257, 270, 271, 296, 364, 391], or does not use the extracted information for a translation towards a semantic enhanced format [183, 214, 279, 329, 330, 402]. Nonetheless, the related work underlines the benefts of analyzing the textual context of a formula. More importantly, the research has shown that even simple noun phrase extraction provide viable information for numerous of applications [139, 183, 214, 279, 329, 330]. This motivated us to apply these promising approaches for our semantifcation pipeline too.

Regarding the fnal translations towards computable formats, our comprehensive analysis of L ATEX to MathML conversion tools in the previous section revealed that we probably gain no benefts from translating LATEX to MathML in an intermediate step. While many CAS provide import functions for MathML, there is no substantial support for OpenMath CDs. Another option would be OpenMath, since the SCSCP protocol uses OpenMath for inter-CAS communications. However, the SCSCP is relatively complex for our task and difcult to extend for new CAS if we do not have access to the internal libraries. Additionally, there are no translation tools from L ATEX to OpenMath even though LATExml can be exploited to realize rule-based translations.

In a previous research project, we developed LACAST, a semantic LATEX to CAS translator, specifcally for the DLMF [3, 13]. The goal of LACAST was to translate DLMF formulae, given in semantic L ATEX, to the CAS Maple. The semantic LATEX macros reduced the ambiguity in mathematical expressions and enabled LACAST to focus on other translation issues, such as defnition disparity between the DLMF and Maple. Hence, we already established a reliable and expandable translation pipeline from semantic LATEX to Maple. As a consequence, we focus our eforts on the more promising semantifcation of LATEX to semantic LATEX rather than from LATEX to content MathML in this thesis41.

This Chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).

<sup>41</sup>Since the original development of LACAST was part of my Master's thesis, the content of the associated early publications [3, 13] is not reused in this thesis. For more details about LACAST, see [13].

*You must strive to fnd your own voice because the longer you wait to begin, the less likely you are going to fnd it at all.*

John Keating - *Dead Poet Society*

### **CHAPTER 3**

### **Semantification of Mathematical LaTeX**

### **Contents**


In this chapter, we will focus on the research task **II**, i.e., we develop a new semantifcation process that addresses the issues of existing approaches outlined in the previous chapter. We identifed two main issues with existing MathIR approaches for disambiguation and semantifcation of LATEX expressions. First, many semantifcation approaches solely focus on single tokens, such as identifers, or the entire mathematical expression but miss to enrich the essential subexpressions between both extremes semantically. Second, existing translation approaches lack context sensitivity and disambiguate expressions by following an internal (often hidden) context-agnostic decision process. This chapter addresses these issues within three parts. First, we elaborate on the capabilities of word embedding techniques to semantically enrich mathematical expressions. Second, we study the frequency distribution of mathematical subexpressions in scientifc corpora to understand the variety and complexity of subexpressions better. Third, we briefy outline a context-sensitive translation pipeline based on the gained knowledge from the frst two parts.

The primary goal of this chapter is to develop a context-sensitive LATEX to CAS translation pipeline. Unfortunately, it is not clear where we can fnd sufcient semantic information in the context to perform reliable translations. We can expect a certain amount of inclusive information in the given expression itself [54, 71, 394]. Additionally, related work has proven that noun phrases in the nearby textual context (such as the leading or following sentences of a formula) can successfully disambiguate math formulae [139, 209, 213, 329]. However, many functions are not necessarily declared in the surrounding context because the author presumes the interpretation is unambiguous. Wolska and Grigore [394] have shown that only around 70% of mathematical identifers are explicitly declared in the surrounding context. In this case, the location of the information that disambiguates the expression may vary greatly depending on many factors, such as the expected education level of the target audience of the article, the given references in the document, or even the author's preferred notation style. One possible solution for exploiting this source of semantic information is to build a common knowledge database for mathematical expressions.

As a frst attempt to automatically build such a common knowledge database that stores the standard, i.e., most common, meanings of mathematical symbols, we explore the capabilities of machine learning algorithms in the frst part of this chapter. Specifcally, we use word embeddings to train common co-occurrences of mathematical and natural language tokens. We will show that this approach is not as successful as we hoped for our knowledge extraction task but enables new approaches for mathematical search engines. Further, the results will once again underline the issues with the interpretation of nested mathematical objects. Word embeddings for mathematical tokens are mainly unable to properly train the connections with defning expressions in the context because they still ignore the function layer of mathematical expressions. In the following, we focused our studies on mathematical subexpressions.

As a thought experiment, consider mathematical expressions are like entire sentences in natural languages rather than single words. Following this analogy, entire math terms are analog to words, and the notation of mathematical expressions certainly follow a specifc grammar [54]. However, our *mathematical sentences* have one distinct diference compared to natural language sentences. The grammar of mathematical expressions is built around a nested structure in contrast to the sequential order of words. For example, a math term representing a variable is a placeholder and can be replaced with arbitrarily complex and deeply nested subexpressions without violating any grammatical rules. This nested structure makes the semantic tokenization of mathematical expressions to a complex and eventually context-dependent task [71, 402]. In order to review our analogy, we perform the most extensive notation analysis of mathematical subexpressions (since those are the potential words) on two real-world scientifc datasets. We discovered that the frequency distributions of mathematical objects obey Zipf's law, similar to words in natural language corpora. In turn, we can use frequency-based retrieval functions to distinguish important or informative mathematical objects from stop-word-like structures. We coin these essential and informative objects Mathematical Objects of Interest (MOI). The success of this new interpretation fnally motivated us to move away from the established MathIR techniques that focus on single identifers or entire math expressions to meaningful subexpressions. Hence, we conclude this chapter with an abstract context-sensitive translation approach that fnally attributes to the nested grammar of mathematical formulae and is based on the new concept of MOI.

In summary, this chapter is organized as follows. In Section 3.1, we explore the capabilities of word embeddings to discover common co-occurrences of natural language tokens and math tokens in large scientifc datasets. In Section 3.2, we introduce the new concept of MOI and perform the frst extensive frequency distribution study of mathematical notations in two large scientifc corpora. Section 3.3 concludes the fndings of the previous sections by introducing a novel context-sensitive translation approach from LATEX to CAS expressions. Section 3.1 was published as an article in the Scientometrics journal [15]. Section 3.2 was published as full paper at the WWW conference [14]. Excerpts of Section 3.3 have been published at the ICMS conference in a full paper [10].

### **3.1 Semantification via Math-Word Embeddings**

Mathematics is capable of explaining complicated concepts and relations in a compact, precise, and accurate way. Learning this idiom takes time and is often difcult, even to humans. The general applicability of mathematics allows a certain level of ambiguity in its expressions. Short explanations or mathematical expressions are often used to mitigate the ambiguity problem, that serve as a context to the reader. Along with context-dependency, inherent issues of linguistics (e.g., ambiguity, non-formality) make it even more challenging for computers to understand mathematical expressions. Nevertheless, a system capable of automatically capturing the semantics of mathematical expressions would be suitable for improving several applications, from search engines to recommendation systems. Word embedding [33, 34, 43, 65, 73, 217, 222, 239, 250, 255, 272, 293, 295] has made it possible to apply deep learning in NLP with great efect. That is because embedding represents individual words with numerical vectors that capture contextual and relational semantics of the words. Such representation enables inputting words and sentences to a Neural Network (NN) in numerical form. This allows the training of NNs and using them as predictive models for various NLP tasks and applications, such as semantic role modeling [149, 412], word-sense disambiguation [160, 305], sentence classifcation [186], sentiment analysis [344], coreference resolution [223, 388], named entity recognition [72], reading comprehension [75], question answering [234], natural language inference [69, 137], and machine translation [97]. The performance of word embedding in NLP tasks has been measured and shown to deliver fairly high accuracy [256, 293, 295].

As math text consists of natural text as well as math expressions that exhibit linear and contextual correlation characteristics that are very similar to those of natural sentences, word embedding applies to math text much as it does to natural text. Accordingly, it is worthwhile to explore the use and efectiveness of word embedding in Mathematical Language Processing (MLP), Mathematical Knowledge Management (MKM), and MathIR. Still, math expressions and math writing styles are diferent from natural text to the point that NLP techniques have to undergo signifcant adaptations and modifcations to work well in math contexts.

While some eforts have started to apply word embedding to MLP, such as equation embedding [121, 9, 215, 400, 404], there is a healthy skepticism about the use of ML and Deep Learning

**59**

(DL) in MLP and MKM, on the basis that much work is still required to prove the efectiveness of DL in MLP. To learn how to adapt and apply DL in the MLP/MKM/MathIR context is not an easy task. Most applications of DL in MLP/MKM/MathIR rest on the efectiveness of word/math-term embedding (henceforth *math embedding*) because the latter is the most basic foundation in language DL. Therefore, it behooves us to start to look at the efectiveness of math embedding in basic tasks, such as term similarity, analogy, information retrieval, and basic math search, to learn more about their extension and limitations. More importantly, we need to learn how to refne and evolve math embedding to become accurate enough for more severe applications, such as knowledge extraction. That is the primary objective of this section.

To that efect, there is a fundamental need for datasets and benchmarks, preferably standard ones, to allow researchers to measure the performance of various math embedding techniques, and applications based on them, in an objective and statistically signifcant way, and to measure improvements and comparative progress. Such resources are abundant in the natural language domain but scarce in the MLP domain. Developing some of such datasets and benchmarks will hopefully form the nucleus for further development by the community to facilitate research and speed up progress in this vital area of research.

While the task of creating such resources for DL applications in MLP can be long and demanding, the examination of math embedding should not wait but should proceed right away, even if in an exploratory manner. Early evaluations of math embedding should ascertain its value for MLP/MKM/MathIR and inform the process and trajectory of creating the corpora and benchmarks. Admittedly, until adequate datasets and benchmarks become available for MLP, we have to resort to less systematic performance evaluation and rely on performing preliminary tests on the limited resources available. The DLMF [98] and arXiv.org preprint archive<sup>1</sup> are good resources to start our exploratory embedding eforts. The DLMF ofers high quality, and the authors are familiar with its structure and content (which aids in crafting some of the tests). As for the arXiv collection, its large volume of mostly math articles makes it an option worth to investigate as well.

In this section, we provide an exploratory investigation of the efectiveness and use of word embedding in MLP and MKM through diferent perspectives. First, we train word2vec models on the DLMF and arXiv with slightly diferent approaches for embedding math. Since the DLMF is primarily a handbook of mathematical equations, it does not provide extensive textual content. We will show that the DLMF trained model is appropriate to discover mathematical term similarities and term analogies, and to generate query expansions. We hypothesize that the arXiv trained models are benefcial to extract defniens, i.e., textual descriptive phrases for math terms. We examine the possible reasons why the word embedding models, trained on the arXiv dataset, does not present valuable results for this task. Besides, we discuss some of the reasons that we believe thwart the progress in MathIR in the direction of machine learning. In summary, we focus on fve tasks (i) term similarity, (ii) math analogies, (iii) concept modeling, (iv) query expansion, and (v) knowledge extraction. In the context of this thesis, we are mostly interested in the latter, i.e., knowledge extractions, and will solely focus on these experiments and results. For the tasks (i-iv), see [15].

<sup>1</sup> https://arxiv.org/ [accessed 2019-09-01]

### **3.1.1 Foundations and Related Work**

Understanding mathematical expressions essentially mean comprehending the semantic value of its internal components, which can be accomplished by linking its elements with their corresponding mathematical defnitions. Current MathIR approaches [213, 329, 330] try to extract textual descriptors of the parts that compose mathematical equations. Intuitively, there are questions that arise from this scenario, such as (i) how to determine the parts which have their own descriptors, and (ii) how to identify correct descriptors over others.

Answers to (i) are more concerned in choosing the correct defnitions for which parts of a mathematical expression are considered as one mathematical object [197, 18, 402]. Current defnition-languages, such as the content MathML 3.02 specifcation, are often imprecise3. For example, content MathML 3.0 uses 'csymbol' elements for functions and specifes them as expressions that *refer to a specifc, mathematically-defned concept with an external defnition*4. However, in case of the Van der Waerden number, for instance, it is not clear whether *W* or the sequence *W*(*r, k*) should be declared as a 'csymbol'. Another example involves content identifers, which MathML specifes as *mathematical variables that have properties, but no fxed value*5. While content identifers are allowed to have complex rendered structures (e.g., *β*<sup>2</sup> *<sup>i</sup>* ), it is not permitted to enclose identifers within other identifers. Let us consider *αi*, where *α* is a vector and *α<sup>i</sup>* its *i*-th element. In this case, *α<sup>i</sup>* should be considered as a composition of three content identifers, each one carrying its own individualized semantic information, namely the vector *α*, the element *α<sup>i</sup>* of the vector, and the index *i*. However, with the current specifcation, the defnition of these identifers would not be canonical. One possible workaround to represent such expressions with content MathML is to use a structure of four nodes, interpreting *α<sup>i</sup>* as a function via a 'csymbol' (one parent 'apply' node with the three children *vector-selector*, *α*, and *i*). However, ML algorithms and MathIR approaches would beneft from more precise defnitions and a unifed answer for (i). Most of the related work relies on these relatively vague defnitions and in the analysis of content identifers, focusing their eforts on (ii).

Questions (i), (ii), and other pragmatic issues are already in discussion in a bigger context, as data production continues to rise and digital repositories seem to be the future for any archive structure. A prominent example is the National Research Council's efort to establish what they call the Digital Mathematical Library (DML)6, a project under the International Mathematical Union. The goal of this project is to take advantage of new technologies and help to solve the inability to search, relate, and aggregate information about mathematical expressions in documents over the web.

The advances most relevant to our work are the recent developments in *word embedding* [43, 65, 73, 256, 293, 295, 313]. Word embedding takes as input a text collection and generates a numerical feature vector (typically with 100 or 300 dimensions) for each word in the collection. This vector captures latent semantics of a word from the contexts of its occurrences in the

<sup>2</sup> https://www.w3.org/TR/MathML3/ [accessed 2019-09-01]

<sup>3</sup> Note that OpenMath is another specifcation designed to encode semantics of mathematics. However, content MathML is an encoding of OpenMath and inherent problems of content MathML also apply to OpenMath (see https://www.openmath.org/om-mml/ [accessed 2019-09-01]).

<sup>4</sup> https://www.w3.org/TR/\gls{mathml}3/chapter4.html#contm.csymbol [accessed 2019-09-01]

<sup>5</sup> https://www.w3.org/TR/\gls{mathml}3/chapter4.html#contm.ci [accessed 2019-09-01]

<sup>6</sup> https://www.nap.edu/read/18619 [accessed 2019-09-01]

collection; in particular, words that often co-occur nearby tend to have similar feature vectors (where similarity is measured by the cosine similarity, the Euclidean distance, etc.).

Recently, more and more projects try to adapt these word embedding techniques to learn patterns of the correlations between context and mathematics. In the work of Gao et al. [121], they embed single symbols and train a model that can discover similarities between mathematical symbols. Similarly to this approach, Krstovski and Blei [215] uses a variation of word embedding to represent complex mathematical expressions as single unit tokens for IR. In 2019, Yusanaga and Laferty [400] explore an embedding technique based on recurrent neural networks to improve topic models by considering mathematical expressions. They state their approach outperforms topic models that do not consider mathematics in text and report a topic coherence improvement of 0*.*012 over the LDA7 baseline. Equation embedding, as in [121, 215, 400], present promising results for identifying similar equations and contextual descriptive keywords. In the following, we will explore in more detail diferent techniques of word embedding.

### **3.1.1.1 Word Embedding**


In this section, we apply *word2vec* [256] on the DLMF [98] and on the collection of arXiv documents for generating embedding vectors for various math symbols and terms. The word2vec technique computes real-valued vectors for words in a document using two main approaches: skip-gram and continuous bag-of-words (CBOW). Both produce a fxed-length *n*-dimensional vector representation for each word in a corpus. In the skip-gram training model, one tries to predict the context of a given the word, while CBOW predicts a target word given its context. In word2vec, context is defned as the adjacent neighboring words in a defned range, called a sliding window. The main idea is that the numerical vectors representing similar words should have close values if the words have similar context, often illustrated by the *king-queen relationship*.

### **King-Qeen Relationship of Word-Embedding Vectors**

The king-queen relationship describes the similarity (in terms of the cosine distance between the vectors) of:

$$
\vec{v}\_{\text{king}} - \vec{v}\_{\text{man}} \approx \vec{v}\_{\text{quenc}} - \vec{v}\_{\text{woman}}, \tag{3.1}
$$

where  *v<sup>t</sup>* represents the vector for the token *t*.

Extending word2vec's approaches, Le and Mikolov [222] propose *Paragraph Vectors*, a framework that learns continuous distributed vector representations for any size of text segments (e.g., sentences, paragraphs, documents). This technique alleviates the inability of word2vec to embed documents as one single entity. This technique also comes in two distinct variations: *Distributed Memory* and *Distributed Bag-of-Words*, which are analogous to the skip-gram and CBOW training models, respectively.

Other approaches also produce word embedding given a training corpus as input, such as fastText [43], ELMo [295], and GloVe [293]. The choice for word2vec for our experiments is justifed because of its implementation ease, training speed using modest computing resources,

<sup>7</sup> Latent Dirichlet Allocation

general applicability, and robustness in several NLP tasks [160, 161, 229, 238, 302, 312]. Additionally, in fastText they propose to learn word representations as a sum of the *n*-grams of its constituent characters (sub-words). The sub-word structure would incorporate a certain noise8 to our experiments. In ELMo, they compute their word vectors as the average of their characters representations, which are obtained through a two-layer bidirectional language model (biLM). This would bring even more granularity than fastText, as they consider each character in a word as having their own *n*-dimensional vector representation. Another factor that prevents us from using ELMo, for now, is its expensive training process9. Closer to the word2vec technique, GloVe [293] is also considered, but its co-occurrence matrix would escalate the memory usage, making its training for arXiv not possible at the moment. We also examine the recently published Universal Sentence Encoder [65] from Google, but their implementation does not allow one to use a new training corpus, only to access its pre-calculated vectors based on words. We also considered BERT [96] with its recent advances of Transformer-based architectures in NLP as an alternative to *word2vec*. However, incorporating BERT and other Transformer-based architectures would require a signifcant restructuring of the core idea of our work. BERT is pre-trained in two general tasks that are not directly transferable to mathematics embeddings: *Masked Language Modelling* and *Next Sentence Prediction*. Since this work is an exploratory investigation of the potential of word embedding techniques in MLP and MKM, we gave preference to tools that could be applied directly. Nonetheless, since some of our results are promising, we plan to include Transformer-based systems, such as BERT [96], XLNet [399], RoBERTa [235], and Transformers-XL [87], in future work.

The overall performance of word embedding algorithms has shown superior results in many diferent NLP tasks, such as machine translation [256], relation similarity [161], word sense disambiguation [55], word similarity [268, 312], and topic categorization [301]. In the same direction, we also explore how well mathematical tokens can be embedded according to their semantic information. However, mathematical formulae are highly ambiguous and, if not properly processed, their representation is jeopardized.

To investigate the situations described in Sections 3.1.1.1 and 2.2.5 we applied word2vec on two diferent scenarios, one focusing on MathIR (DLMF) and the other on semantic knowledge extraction (arXiv), i.e., identifying defniens for math objects. To summarize our decisions, for the DLMF and arXiv, we choose the stream of token embedding technique, i.e., each inner token is represented as a single *n*-dimensional vector in the embedding model. For the DLMF, we embed all inner tokens, while for the arXiv, we only embed the identifers. In this thesis, we are more interested in applying math embeddings to semantic extraction task. The MathIR task is described in [15, Section 3].

### **3.1.2 Semantic Knowledge Extraction**

Extracting defniens of mathematical objects from a textual context is a common task in MathIR [214, 279, 329, 330, 405] that often provides a gold dataset for its evaluation. Since the DLMF does not provide extensive textual information for its mathematical expressions, we considered an alternative scenario in our analysis, one in which we trained a second word2vec model on a much larger corpus composed of articles/papers from the arXiv collection. In this section, we compare our fndings against the approach by Schubotz et al. [330]. We apply varia-

**63**

<sup>8</sup> Noise means, the data consists of many uninteresting tokens that afect the trained model negatively. 9

https://github.com/allenai/bilm-tf [accessed 2019-09-01]

tions of a word2vec [256] and paragraph vectors [222] implementation to extract mathematical relations from the arXMLiv 2018 [132] dataset (i.e., an HTML collection of the arXiv.org preprint archive10), which is used as our training corpus. We also consider the subsets that do not report errors during the document conversion (i.e., *no\_problem* and *warning*) which represent 70% of archive.org. We make the code, regarding our experiments, publicly available11.

### **3.1.2.1 Evaluation of Math-Embedding-Based Knowledge Extraction**

As a pre-processing step, we represent mathematical expressions using the MathML12 notation. First, we replace all mathematical expressions with the identifers sequence it contains, i.e., *W*(2*, k*) is replaced by '*W k*'. We also add the prefx 'math-' to all identifer tokens to distinguish between textual and mathematical terms later. Second, we remove all common English stopwords from the training corpus. Finally, we train a word2vec model (skip-gram) using the following hyperparameters13: vector size of 300 dimensions, a window size of 15, minimum word count of 10, and a negative sampling of 1*E* − 5. We justify the hyperparameter used in our experiments based on previous publications using similar models [63, 221, 222, 255, 312].

In the following, distances between vectors are calculated via the cosine distance. The trained model was able to partially incorporate semantics of mathematical identifers. For instance, the closest 27 vectors to the mathematical identifer *f* are mathematical identifers themselves and the fourth closest noun vector to *f* is '*function*'. We observe that the results of the model trained on arXiv are comparable with our previous experiments on the DLMF.

Previously, we used the semantic relations between embedding vectors to search for relevant terms in the model. Hereafter, we will refer to this algebraic property as *semantic distance* to a given term with respect to a given relation, i.e., two related vectors. For example, to answer the query/question: What is to 'complex' as *x* is to 'real', one has to fnd the closest *semantic vectors* to 'complex' with respect to the relation between *x* and 'real', i.e., fnding  *v* in

$$
\vec{v} - \vec{v}\_{\text{complex}} \approx \vec{v}\_x - \vec{v}\_{\text{real}}.
$$

Instead of asking for mathematical expressions, we will now reword the query to ask for specifc words. For example, to retrieve the meaning of *f* from the model, we can ask for: What is to *f* as 'variable' is to *x*? Or in other words, what is semantically close to *f* with respect to the relation between 'variable' and *x*? Table 3.1 shows the top 10 semantically closest results to *f* with respect to the relations between  *v*variable and  *vx*,  *v*variable and  *vy*, and  *v*variable and  *va*.

From Table 3.1, we can observe a similar behaviour. Later, we will explore that mathematical vectors build a cluster in the trained model, i.e., that the vectors of  *v<sup>f</sup>* ,  *vx*, and  *v<sup>y</sup>* are close to each other with respect to the cosine similarity. This cluster, and the fact that we did not use stemming and lemmatization for preprocessing, explains that the top hit to the queries is always 'variables'. To refne the order of the extracted answers, we calculated the cosine similarity between  *v<sup>f</sup>* and the vectors for the extracted words directly. Table 3.2 shows the cosine distances between  *v<sup>f</sup>* and the extracted words from the query: *Term* is to *f* what 'variable' is to *a*.

<sup>10</sup>https://arxiv.org/ [accessed 2019-09-01]

<sup>11</sup>https://github.com/ag-gipp/math2vec [accessed 2019-09-01]

<sup>12</sup>The source TEX fle has to use mathematical environments for its expressions. 13Non mentioned hyperparameters are used with their default values as described in the Gensim API [307]


Table 3.1: Analogies of the form: Find the *Term* where *Term* is a word that is to X what Y is to Z.

Asking for the meaning of *f* is a very generic question. Thus, we performed a detailed evaluation on the frst 100 entries<sup>14</sup> of the MathMLben benchmark [18]. We evaluated the average of the *semantic distances* with respect to the relations between  *v*variable and  *v*x,  *v*variable and  *v*a, and  *v*function and  *v*f . We have chosen to test on these relations because we believe that these relations are the most general and still applicable, e.g., seen in Table 3.2. In addition, we consider only results with a cosine similarity equal to or greater than 0*.*70 to maintain a minimum quality in our experiments. The overall results were poor, with a precision of *p* = *.*0023 and a recall of *r* = *.*052. Despite the weak results, an investigation of some specifc examples showed interesting characteristics; for example, for the identifer *W*, the four semantically closest results were *functions*, *variables*, *form*, and the mathematical identifer *q*. The poor performance illustrates that there might be underlying issues with our approach. However, as mentioned before, mathematical notation is highly fexible and content-dependent. Hence, in the next section, we explore a technique that rearranges the hits according to the actual close context of the mathematical expression.

### **3.1.2.2 Improvement by Considering the Context**

We also investigate how a diferent word embedding technique would afect our experiments. To do so, we trained a Distributed Bag-of-Words of Paragraph Vectors (DBOW-PV) [222] model. We trained this DBOW-PV in the same corpus as our word2vec model (with the same preprocessing steps) with the following confguration: 400 dimensions, a window size of 25, and minimum count of 10 words. Schubotz et al. [330] analyze all occurrences of mathematical identifers and consider the entire article at once. We believe this prevents the algorithm from fnding the right descriptor in the text, since later or prior occurrences of an identifer might appear in a diferent context, and potentially introduce diferent meanings. Instead of using the entire document, we consider the algorithm by Schubotz et al. [330] only in the input paragraph and

**65**

<sup>14</sup>Same entries used in [330]


Table 3.2: The cosine distances of *f* regarding to the hits in Table 3.1.

similar paragraphs given by our DBOW-PV model. Unfortunately, the obtained variance within the paragraphs brings a high number of false positives to the list of candidates, which afects the performance of the original approach negatively.

As a second approach for improving our system, we considered a given textual context to reorder extracted words according to their cosine similarities to the given context. For example, consider the sentence: 'Let *f*(*x, y*) be a continuous function where *x* and *y* are arbitrary values.'. We ask for the meaning of *f* concerning this given context sentence. The top-k closest words to *f* in the word2vec model only represent the distance over the entire corpus, in this case, arXiv, but not regarding a given context. To address this issue, we retrieved similar paragraphs to our context example via the DBOW-PV model and computed the weighted average distance between all top-k words, that are similar to *f* and the retrieved sentences. We expected that the word describing *f* in our example sentence would also present a higher cosine similarity to the context itself. Table 3.3 shows the top-10 closest words (i.e., we fltered out other math tokens) and their cosine similarity to *f* in the left column. The right column shows the average cosine similarities of the extracted words to the context example sentence we used and its retrieved similar sentences.

As Table 3.3 illustrates, this context-sensitive approach was not benefcial; in fact it undermined our model. According to the fact that the identifer should be closer to the given context sentence rather than to the related sentences retrieved from the DBOW-PV model, we also explored the use of weighted average. However, the weighted average did not improve the results of the normal average. Other hyperparameters for the word embedding models were also tested in an attempt to tune our system. However, we could not determine any drastic changes regarding the measured performances.

### **3.1.2.3 Visualizing Our Model**

Figure 3.1 illustrates four t-SNE[154] plots of our word2vec model. Since t-SNE plots may produce misleading structures [382], we plot four t-SNE plots with diferent perplexity values. Table 3.3: We are looking for descriptive terms for *f* in a given context '*Let f*(*x, y*) *be a continuous function where x and y are arbitrary values*'. To achieve this, we retrieved close vectors to *f* and computed their distances to the given context sentence. To bring variety to the context, we used our DBOW-PV model to retrieve related sentences to the given context and computed the average distance of the words to these related sentences.


Other parameters were set to their default values according to the t-SNE python package. We colored word tokens in blue and math tokens in red. The plots illustrate, though not surprisingly, that math tokens are clustered together. However, a certain subset of math tokens appear isolated from other math tokens. By attaching the content to some of the vectors, we can see that math tokens, such as *and* (an *and* within math mode) and *im* (most likely referring to imaginary numbers) are part of a second cluster of math tokens. The plot is similar to the visualized model presented in [121], even though they use a diferent word embedding technique. Hence, the general structure within math word2vec models seems to be insensitive to the embedding technique of formulae used. Compared to [121], we provide a model with richer details that reveal some dense clusters, e.g., numbers (bottom right plot at (11*,* 8)) or equation labels (bottom right plot at (−14*,* 0)).

Based on the presented results, one can still argue that more settings should be explored (e.g., different parameters and embedding techniques) for the embedding phase, diferent pre-processing steps (e.g., stemming and lemmatization) should be adopted, and post-processing techniques (e.g., boosting terms of interest based on a knowledge database such as OntoMathPro [104, 105]) should also be investigated. This presumably solves some minor problems, such as removing the inaccurate frst hit in Table 3.1. Nevertheless, the overall results would not surpass the ones in [330], which reports a precision score of *p* = 0*.*48. On the grounds that mathematics is highly customizable, many of the defned relations between mathematical concepts and their descriptors are only valid in a local scope. Let us consider an author that notates his algorithm using the symbol *π*. The author's specifc use of *π* does not change its general use, but it afects the meaning in the scope of the article. Current ML approaches only learn patterns of most frequently used combinations, e.g., between *f* and 'function', as seen in Table 3.1.

**67**

Figure 3.1: t-SNE plot of top-1000 closest vectors of the identifer *f* with perplexity values 5 (top left), 10 (top right), 40 (bottom left), and 100 (bottom right) and the default values of the t-SNE python package for other settings.

Even though math notations can change, such as *π* in the example above, one could assume the existence of a common ground for most notations. The low performance of our experiments compared to the results in [330] seem to confrm that math notations change regularly in real-world documents, i.e., are tied to a specifc context. If a common ground exists, for math notations, it must be marginally small, at least in the 100 test cases from [18].

### **3.1.3 On Overcoming the Issues of Knowledge Extraction Approaches**

We assume the low performance regarding our knowledge extraction experiments are caused by fundamental issues that should be discussed before more eforts are made to train ML algorithms for extracting knowledge of math expressions. In the following, we discuss some reasons that we believe can help ML algorithms to understand mathematics better.

It is reported that 70% of mathematical symbols are explicitly declared in the context [394]. Only four reasons justify an explicit declaration in the context: (a) a new mathematical symbol is defned, (b) a known notation is changed, (c) used symbols are present in other contexts and require specifcations to be correctly interpreted, or (d) authors' declarations are redundant (e.g., for improving readability). We assume (d) is a rare scenario compared to the other ones (a-c), except in educational literature. Current math-embedding techniques can learn semantic connections only in that 70%, where the defniens is available. Besides (d), the algorithm would learn either rare notations (in case of (a)) or ambiguous notations (in cases (b-c)). The fexibility that mathematical documents allow to (re)defne used mathematical notations further corroborates the complexity of learning mathematics.

Learning algorithms would beneft from literature focused on (a) and (d), instead of (b) and (c). Similar to students who start to learn mathematics, ML algorithms have to consider the structure of the content they learn. It is hard to learn mathematics only considering arXiv documents without prior or complementary knowledge. Usually, these documents represent state-of-theart fndings containing new and unusual notations and lack of extensive explanations (e.g., due to page limitations). In contrast, educational books carefully and extensively explain new concepts. We assume better results can be obtained if ML algorithms are trained in multiple stages, frst on educational literature, then on datasets of advanced math articles. A basic model trained in educational literature should capture standard relations between mathematical concepts and descriptors. This model should also be able to capture patterns independently of how new or unusual the notations are present in the literature. In 2014, Matsuzaki et al. [247] presented some promising results to answer mathematical questions from Japanese university entrance exams automatically. While the approach involves many manual adjustments and analysis, the promising results illustrate the diferent levels of knowledge that is still required for understanding arXiv documents vs. university entrance level exams. A well-structured digital mathematical library that distinguishes the diferent levels of sophistication in articles (e.g. introductions vs. state-of-the-art publications) would also beneft mathematical machine learning tasks.

The lack of references and applications that provide a solid semantic structure of natural language for mathematical identifers make the disambiguation process of the latter even more challenging. In natural texts, one can try to infer the most suitable word sense for a word based on the lemma15 itself, the adjacent words, dictionaries, and thesauri to name a few. However, in the mathematical arena, the scarcity of resources and the fexibility of redefning their identifers make this issue much harder. The context text preceding or following the mathematical equation is essential for its understanding. This context can be considered in a long or short distance away from the equation, which aggravates the problem. Thus, a comprehensive annotated dataset that addresses these needs of structural knowledge would enable further progress in MathIR with the help of ML algorithms.

Another primary source of complexity is the inherent ambiguity present in any language, especially in mathematics. A typical workaround in linguistics for such ambiguous notations is to consider the use of lexical databases (e.g., WordNet [116, 261]) to identify the most suitable word senses for a given word. These databases allow embeddings algorithms to train a vector for each semantic meaning for every token. For example, *Java* could have multiple vectors in a single model according to its diferent meanings of the word, e.g., the island in the south of Indonesia, the programming language or the cofee beans. However, mathematics lacks such systems, which makes its adoption not feasible at the moment. Youssef [402] proposes the use of tags, similarly to the PoS tags in linguistics, but for tagging mathematical TEX tokens, bringing more information to the tokens considered. As a result, a lexicon containing several meanings for a large set of mathematical symbols is developed. OntoMathPro [104, 105] aims for generating a comprehensive ontology of mathematical knowledge and, therefore, also contain

**69**

<sup>15</sup>canonical form, dictionary form, or citation form of a set of words

information about the diferent meanings of mathematical tokens. Such dictionaries might enable the disambiguation approaches in linguistics to be used in mathematical embedding in the near future.

Another issue in recent publications is the lack of standards and the scarcity of benchmarks to properly evaluate MathIR algorithms. Krstovski and Blei [215], and Yasunaga and Lafferty [400] provide an interesting perspective on the problem of mathematic embeddings. Their experiments are focused on math-analogies. Our fndings on Section 3.2 corroborate with the math-analogies results, as our experiments have comparable results in a controlled environment. However, because of a missing well-established benchmark, we, as well the mentioned publications, are only able to provide incipient results. Existing datasets are often created for and, therefore, limited to specifc tasks. For example, the NTCIR math tasks [21, 22, 405] or the upcoming ARQMath<sup>16</sup> task, provide datasets that are specifcally designed to tackle problems of mathematical search engines. The secondary task of ARQMath actually search for math-analogies. In general, a proper, common standard for interpreting semantic structures of mathematics (see for example the mentioned problems with *α<sup>i</sup>* in Section 2) would be benefcial for several tasks in MathIR, such as semantic knowledge extraction.

### **3.1.4 The Future of Math Embeddings**

As we explored through this section, our preliminary results stress the urgent need for creating extensive math-specifc benchmarks for testing math embedding techniques on math-specifc tasks. To appreciate more the magnitude and dimensions of creating such benchmarks, it is instructive to look at some of those developed for NLP whose tasks can benefcially inform and guide corresponding tasks in MLP. The NLP benchmarks include one for natural language inference [47], one for machine comprehension [306], one for semantic role modeling [281], and one for language modeling [68], to name a few. With such benchmarks, which are often *de facto* standards for the corresponding NLP tasks, the NLP research community has been able to (1) measure the performance of new techniques up to statistical signifcance, and (2) track progress in various NLP techniques, including deep learning for NLP, by quickly comparing the performance of new techniques to others and to the state-of-the-art.

While our exploratory studies regarding our term similarities, analogies, and query expansions need extensive future experimentation for statistically signifcant validation on large datasets and benchmarks, they show some of the promise and limitations of word embedding in math (MLP) applications. Especially its applicability for our desired knowledge extraction process is highly questionable. One of the main issues we encountered for embedding mathematics is the inability to model the nested semantic structure of mathematical expressions. In the following, we will further explore properties of mathematical subexpressions by analyzing their frequency distributions in large datasets.

### **3.2 Semantification with Mathematical Objects of Interest**

As discussed before, math expressions often contain meaningful and important subexpressions. MathIR [141] applications could beneft from an approach that lies between the extremes of

<sup>16</sup>https://www.cs.rit.edu/~dprl/ARQMath/ [accessed 2020-02-01]

examining only individual symbols or considering an entire equation as one entity. Consider for example, the explicit defnition for Jacobi polynomials [98, (18.5.7)]

$$\begin{array}{l} \bullet \\\\ P\_n^{(\alpha, \beta)}(x) = \frac{\Gamma(\alpha + n + 1)}{n! \Gamma(\alpha + \beta + n + 1)} \sum\_{m=0}^n \binom{n}{m} \frac{\Gamma(\alpha + \beta + n + m + 1)}{\Gamma(\alpha + m + 1)} \left(\frac{x - 1}{2}\right)^m \end{array} (3.2)$$

The *interesting* components in this equation are *<sup>P</sup>*(*α,β*) *<sup>n</sup>* (*x*) on the left-hand side, and the appearance of the gamma function Γ(*s*) on the right-hand side, implying a direct relationship between Jacobi polynomials and the gamma function. Considering the entire expression as a single object misses this important relationship. On the other hand, focusing on single symbols can result in the misleading interpretation of Γ as a variable and Γ(*α*+*n*+1) as a multiplication between Γ and (*α* + *n* + 1). A system capable of identifying the important components, such as *<sup>P</sup>*(*α,β*) *<sup>n</sup>* (*x*) or Γ(*<sup>α</sup>* <sup>+</sup> *<sup>n</sup>* + 1), is therefore desirable. Hereafter, we defne these components as Mathematical Objects of Interest (MOI) [9].

The *importance* of math objects is a somewhat imprecise description and thus difcult to measure. Currently, not much efort has been made in identifying meaningful subexpressions. Kristianto et al. [214] introduced dependency graphs between formulae. With this approach, they were able to build dependency graphs of mathematical expressions, but only if the expressions appeared as single expressions in the context. For example, if Γ(*α* + *n* + 1) appears as a stand-alone expression in the context, the algorithm will declare a dependency with Equation (3.2). However, it is more likely that diferent forms, such as Γ(*s*), appear in the context. Since this expression does not match any subexpression in Equation (3.2), the approach cannot establish a connection with Γ(*s*). Kohlhase et al. studied in [191, 193, 196] another approach to identify essential components in formulae. They performed eye-tracking studies to identify important areas in rendered mathematical formulae. While this is an interesting approach that allows one to learn more about the insights of human behaviors of reading and understanding math, it is inaccessible for extensive studies.

This section presents the frst extensive frequency distribution study of mathematical equations in two large scientifc corpora, the e-Print archive arXiv.org (hereafter referred to as arXiv17) and the international reviewing service for pure and applied mathematics zbMATH18. We will show that math expressions, similar to words in natural language corpora, also obey Zipf's law [297], and therefore follows a *Zipfan* distribution. Related research projects observed a relation to Zipf's law for single math symbols [71, 329]. In the context of quantitative linguistics, Zipf's law states that given a text corpus, the frequency of any word is inversely proportional to its rank in the frequency table. Motivated by the similarity to linguistic properties, we will present a novel approach for ranking formulae by their relevance via a customized version of the ranking function BM25 [310]. We will present results that can be easily embedded in other systems in order to distinguish between common and uncommon notations within formulae. Our results lay a foundation for future research projects in MathIR.

<sup>17</sup>https://arxiv.org/ [accessed 2019-09-01]

<sup>18</sup>https://zbmath.org [accessed 2019-09-01]

Fundamental knowledge on frequency distributions of math formulae is benefcial for numerous applications in MathIR, ranging from educational purposes [341] to math recommendation systems [50], search engines [92, 274], and even automatic plagiarism detection systems [253, 254, 334]. For example, students can search for the conventions to write certain quantities in formulae; document preparation systems can integrate an auto-completion or auto-correction service for math inputs; search or recommendation engines can adjust their ranking scores with respect to standard notations; and plagiarism detection systems can estimate whether two identical formulae indicate potential plagiarism or are just using the conventional notations in a particular subject area. To exemplify the applicability of our fndings, we present a textual search approach to retrieve mathematical formulae. Further, we will extend zbMATH's faceted search by providing facets of mathematical formulae according to a given textual search query. Lastly, we present a simple auto-completion system for math inputs as a contribution towards advancing mathematical recommendation systems. Further, we show that the results provide useful insights for plagiarism detection algorithms. We provide access to the source code, the results, and extended versions of all of the fgures appearing in this paper at https : //github.com/ag-gipp/FormulaCloudData.

### **3.2.1 Related Work**

Today, mathematical search engines index formulae in a database. Much efort has been undertaken to make this process as efcient as possible in terms of precision and runtime performance [92, 181, 231, 236, 407]. The generated databases naturally contain the information required to examine the distributions of the indexed mathematical formulae. Yet, no in-depth studies of these distributions have been undertaken. Instead, math search engines focus on other aspects, such as devising novel similarity measures and improving runtime efciency. This is because the goal of math search engines is to retrieve relevant (i.e., similar) formulae which correspond to a given search query that partially [211, 231, 274] or exclusively [92, 181, 182] contains formulae. However, for a fundamental study of distributions of mathematical expressions, no similarity measures nor efcient lookup or indexing is required. Thus, we use the general-purpose query language XQuery and employ the BaseX<sup>19</sup> implementation. BaseX is a free open-source XML database engine, which is fully compatible with the latest XQuery standard [140, 396]. Since our implementations rely on XQuery, we are able to switch to any other database which allows for processing via XQuery.

### **3.2.2 Data Preparation**

L ATEX is the de facto standard for the preparation of academic manuscripts in the felds of mathematics and physics [129]. Since LATEX allows for advanced customizations and even computations, it is challenging to process. For this reason, LATEX expressions are unsuitable for an extensive distribution analysis of mathematical notations. For mathematical expressions on the web, the XML formatted MathML20 is the current standard, as specifed by the World Wide Web Consortium (W3C). The tree structure and the fxed standard, i.e., MathML tags, cannot be changed, thus making this data format reliable. Several available tools are able to convert from L ATEX to MathML [18] and various databases are able to index XML data. Thus, for this study,

<sup>19</sup>http://basex.org/ [accessed 2019-09-01]; We used BaseX 9.2 for our experiments.

<sup>20</sup>https://www.w3.org/TR/MathML3/ [accessed 2019-09-01]

we have chosen to focus on MathML. In the following, we investigate the databases arXMLiv (08/2018) [132] and zbMATH<sup>21</sup> [333].

The arXMLiv dataset (≈1.2 million documents) contains HTML5 versions of the documents from the e-Print archive arXiv.org. The HTML5 documents were generated from the TEX sources via LATExml [257]. LATExml converted all mathematical expressions into MathML with parallel markup, i.e., presentation and content MathML. In this study we only consider the subsets *no-problem* and *warning*, which generated no errors during the conversion process. Nonetheless, the MathML data generated still contains some errors or falsely annotated math. For example, we discovered several instances of afliation and footnotes, SVG<sup>22</sup> and other unknown tags, encoded in MathML. Regarding the footnotes, we presumed that authors falsely used mathematical environments for generating footnote or afliation marks. We used the TEX string, provided as an attribute in the MathML data, to flter out expressions that match the string '{}^{\*}', where '\*' indicates any possible expression. In addition, we fltered out SVG and other unknown tags. We assume that these expressions were generated by mistake due to limitations of LATExml. The fnal arXiv dataset consisted of 841,008 documents which contained at least one mathematical formula. The dataset contained a total of 294,151,288 mathematical expressions.

In addition to arXiv, we investigated zbMATH, an international reviewing service for pure and applied mathematics which contains abstracts and reviews of articles, hereafter uniformly called abstracts, mainly from the domains of pure and applied mathematics. The abstracts in zbMATH are formatted in TEX [333]. To be able to compare arXiv and zbMATH, we manually generated MathML via LATExml for each mathematical formula in zbMATH and performed the same flters as used for the arXiv documents. The zbMATH dataset contained 2,813,451 abstracts, of which 1,349,297 contained at least one formula. In total, the dataset contained 11,747,860 formulae. Even though the total number of formulae is smaller compared to arXiv, we hypothesize that math formulae in abstracts are particularly meaningful.

### **3.2.2.1 Data Wrangling**

Since we focused on the frequency distributions of visual expressions, we only considered pMML. Rather than normalizing the pMML data, e.g., via MathMLCan [117], which would also change the tree structure and visual core elements in pMML, we only eliminated the attributes. These attributes are used for minor visual changes, e.g., stretched parentheses or inline limits of sums and integrals. Thus, for this frst study, we preserved the core structure of the pMML data, which might provide insightful statistics for the MathML community to further cultivate the standard. After extracting all MathML expressions, fltering out falsely annotated math and SVG tags, and eliminating unnecessary attributes and annotations, the datasets required 83GB of disk space for arXiv and 6GB for zbMATH, respectively.

In the following, we indexed the data via BaseX. The indexed datasets required a disk space of 143.9GB in total (140GB for arXiv and 3.9GB for zbMATH). Due to the limitations23 of databases in BaseX, it was necessary to split our datasets into smaller subsets. We split the datasets

<sup>21</sup>https://zbmath.org/ [accessed 2019-09-01]

<sup>22</sup>Scalable Vector Graphics

<sup>23</sup>A detailed overview of the limitations of BaseX databases can be found at http://docs.basex.org/wiki/ Statistics [accessed 2019-09-01].

according to the 20 major article categories of arXiv<sup>24</sup> and classifcations of zbMATH. To increase performance, we use BaseX in a server-client environment. We experienced performance issues in BaseX when multiple clients repeatedly requested data from the same server in short intervals. We determined that the best workaround for this issue was to launch BaseX servers for each database, i.e., each category/classifcation.

Mathematical expressions often consist of multiple meaningful subexpressions, which we defned as MOIs. However, without further investigation of the context, it is impossible to determine meaningful subexpressions. As a consequence, every equation is a potential MOI on its own and potentially consists of multiple other MOIs. For an extensive frequency distributional analysis, we aim to discover all possible mathematical objects. Hence, we split every formula into its components. Since MathML is an XML data format (essentially a tree-structured format), we defne subexpressions of equations as subtrees of its MathML format.


Listing 3.1: MathML representation of *<sup>P</sup>*(*α,β*) *<sup>n</sup>* (*x*).

Listing 3.1 illustrates a Jacobi polynomial *<sup>P</sup>*(*α,β*) *<sup>n</sup>* (*x*) in pMML. The <mo> element on line 14 contains the *invisible times* UTF-8 character. By defnition, the <math> element is the root element of MathML expressions. Since we cut of all other elements besides pMML nodes, each <math> element has one and only one child element25. Thus, we defne the child element of the <math> element as the root of the expression. Starting from this root element, we explore all subexpressions. For this study, we presume that every meaningful mathematical object (i.e., MOI) must contain at least one identifer.

Hence, we only study subtrees which contain at least one <mi> node. Identifers, in the sense of MathML, are '*symbolic names or arbitrary text*' 26, e.g., single Latin or Greek letters. Identifers do not contain special characters (other than Greek letters) or numbers. As a consequence, arithmetic expressions, such as (1 + 2)2, or sequences of special characters and numbers, such as {1*,* 2*, ...*} ∩ {−1}, will not appear in our distributional analysis. However, if a sequence or arithmetic expression consists of an identifer somewhere in the pMML tree (such as in {1*,* 2*, ...*} ∩ *A*), the entire expression will be recognized. The Jacobi polynomial *<sup>P</sup>*(*α,β*) *<sup>n</sup>* (*x*), therefore consists of the following subexpressions: *<sup>P</sup>*(*α,β*) *<sup>n</sup>* , (*α, β*), (*x*), and the single identifers *<sup>P</sup>*, *n*, *α*, *β*, and *x*. The entire expression is also a mathematical object. Hence, we take entire expressions with an identifer into

account for our analysis. In the following, the set of subexpressions will be understood to include the expression itself.

For our experiments, we also generated a string representation of the MathML data. The string is generated recursively by applying one of two rules for each node: (i) if the current node is a leaf, the node-tag and the content will be merged by a colon, e.g., <mi>x</mi> will be converted

<sup>24</sup>The arXiv categories *astro-ph* (astro physics), *cond-mat* (condensed matter), and *math* (mathematics) were still too large for a single database. Thus, we split those categories into two equally sized parts.

<sup>25</sup>Sequences are always nested in an <mrow> element.

<sup>26</sup>https://www.w3.org/TR/MathML3/chapter3.html [accessed 2019-09-01]

to mi:x; (ii) otherwise the node-tag wraps parentheses around its content and separates the children by a comma, e.g.,

$$(x)<\tag{3.3}$$

will be converted to

$$\{\mathtt{mrow}(\mathtt{mo};\{\mathtt{m}\mathrel{{}}\mathtt{m}\mathrel{{}}\mathtt{x},\mathtt{mo};\mathtt{?})\}\}.\tag{3.4}$$

Furthermore, the special UTF-8 characters for invisible times (U+2062) and function application (U+2061) are replaced by ivt and fa, respectively. For example, the gamma function with argument *x* + 1, Γ(*x* + 1) would be represented by

$$\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\texttt{\bot}}}}}}}}}}}}}\texttt{\texttt{\texttt{\cdot}}}}\texttt{\texttt{\cdot}}$$

Between Γ and (*x*+ 1), there would most likely be the special character for *invisible times* rather than for *function application*, because LATExml is not able to parse Γ as a function. Note that this string conversion is a bijective mapping. The string representation reduces the verbose XML format to a more concise presentation. Thus, an equivalence check between two expressions is more efcient.

### **3.2.2.2 Complexity of Math**

Mathematical expressions can become complex and lengthy. The tree structure of MathML allows us to introduce a measure that refects the complexity of mathematical expressions. More complex expressions usually consist of more extensively nested subtrees in the MathML data. Thus, we defne the complexity of a mathematical expression by the maximum depth of the MathML tree. In XML the content of a node and its attributes are commonly interpreted as children of the node. Thus, we defne the depth of a single node as 1 rather than 0, i.e., single identifers, such as <mi>P</mi>, have a complexity of 1. The Jacobi polynomial from Listing 3.1 has a complexity of 4.

We perform the extraction of subexpressions from MathML in BaseX. The algorithm for the extraction process is written in XQuery. The algorithm traverses recursively downwards from the root to the leaves. In each iteration, it checks whether there is an identifer, i.e., <mi> element, among the descendants of the current node. If there is no such element, the subtree will be ignored. It seems counterintuitive to start from the root and check if an identifer is among the descendants rather than starting at each identifer and traversing upwards to the root. If an XQuery requests a node in BaseX, BaseX loads the entire subtree of the requested node into the cache (up to a specifed size). If the algorithm traverses upwards through the MathML tree, the XQuery will trigger database requests in every iteration. Hence, the downwards implementation performs better, since there is only one database request for every expression rather than for every subexpression.

Since we only minimize the pMML data rather than normalizing it, two identically rendered expressions may have diferent complexities. For instance,

$$\mathbf{x} \nleq \mathbf{m} \mathbf{w} \succcurlyeq \mathbf{m} \mathbf{i} \geq \mathbf{x} \not<\/m\mathbf{m} \mathbf{w} \succcurlyeq \tag{3.6}$$

consists of two distinct subexpressions, but both of them are displayed the same. Another problem often appears for arrays or similar visually complicated structures. The extracted expressions are not necessarily logical subexpressions. We will consider applying more advanced embedding techniques such as special tokenizers [231], symbol layout trees [92, 407], and a MathML normalization via MathMLCan [117] in future research to overcome these issues.

**75**

### **3.2.3 Frequency Distributions of Mathematical Formulae**

By splitting each formula into subexpressions, we generated longer documents and a bias towards low complexities. Note that, hereafter, we only refer to the mathematical content of documents. Thus, the length of a document refers to the number of math formulae - here the number of subexpressions - in the document. After splitting expressions into subexpressions, arXiv consists of 2*.*5B and zbMATH of 61M expressions, which raised the average document length to 2*,*982*.*87 for arXiv and 45*.*47 for zbMATH, respectively.

For calculating frequency distributions, we merged two subexpressions if their string representations were identical. Remember, the string representation is unique for each MathML tree. After merging, arXiv consisted of 350,206,974 unique mathematical subexpressions with a maximum complexity of 218 and an average complexity of 5*.*01. For high complexities over 70, the formulae show some erroneous structures that might be generated from LATExml by mistake. For example, the expression with the highest complexity is a long sequence of a polynomial starting with '*P*4(*t*1*, t*3*, t*7*, t*11) =' followed by 690 summands. The complexity is caused by a high number of unnecessarily deeply nested <mrow> nodes. The highest complexity with a minimum document frequency of two is 39, which is a continued fraction. Since continued fractions are nested fractions, they naturally have a large complexity. One of the most complex expressions (complexity 20) with a minimum document frequency of three was the formula

$$\left(\sum\_{j\_1=1}^n \left(\sum\_{j\_2=1}^n \left(\dots\left(\sum\_{j\_m=1}^n \left| T\left(e\_{j\_1},\dots,e\_{j\_m}\right)\right|^{q\_m}\right)^{\frac{q\_m}{q\_m}}\dots\right)^{q\_2}\right)^{\frac{q\_2}{q\_3}}\right)^{\frac{1}{q\_2}} \le C\_{m,p,\mathbf{q}}^{\mathbf{K}} \|T\|.\tag{3.7}$$

In contrast, zbMATH only consisted of 8,450,496 unique expressions with a maximum complexity of 26 and an average complexity of 3*.*89. One of the most complex expressions in zbMATH with a minimum document frequency of three was

$$M\_p(r,f) = \left(\frac{1}{2\pi} \int\_0^{2\pi} \left| f\left(re^{i\theta}\right) \right|^p d\theta \right)^{1/p} . \tag{3.8}$$

As we expected, reviews and abstracts in zbMATH were generally shorter and consisted of less complex mathematical formulae. The dataset also appeared to contain fewer erroneous expressions, since expressions of complexity 25 are still readable and meaningful.

Figure 3.2 shows the ratio of unique subexpressions for each complexity in both datasets. The fgure illustrates that both datasets share a peak at complexity four. Compared to zbMATH, the arXiv expressions are slightly more evenly distributed over the diferent levels of complexities. Interestingly, complexities one and two are not dominant in either of the two datasets. Single identifers only make up 0*.*03% in arXiv and 0*.*12% in zbMATH, which is comparable to expressions of complexity 19 and 14, respectively. This fnding illustrates the problem of capturing semantic meanings for single identifers rather than for more complex expressions [330]. It also substantiates that entire expressions, if too complex, are not suitable either for capturing the semantic meanings [214]. Instead, a middle ground is desirable, since the most unique expressions in both datasets have a complexity between 3 and 5. Table 3.4 summarizes the statistics of the examined datasets.

### **76 Chapter 3** Semantifcation of Mathematical LaTeX

Figure 3.2: Unique subexpressions for each complexity in arXiv and zbMATH.

Table 3.4: Dataset overview. Average Document Length is defned as the average number of subexpressions per document.


### **3.2.3.1 Zipf's Law**

In linguistics, it is well known that word distributions follow Zipf's Law [297], i.e., the *r*-th most frequent word has a frequency that scales to

$$f(r) \propto \frac{1}{r^{\alpha}}\tag{3.9}$$

with *α* ≈ 1. A better approximation can be applied by a shifted distribution

$$f(r) \propto \frac{1}{(r+\beta)^{\alpha}},\tag{3.10}$$

where *α* ≈ 1 and *β* ≈ 2*.*7. In a study on Zipf's law, Piantadosi [297] illustrated that not only words in natural language corpora follow this law surprisingly accurately, but also many other human-created sets. For instance, in programming languages, in biological systems, and even in music. Since mathematical communication has derived as the result of centuries of research, it would not be surprising if mathematical notations would also follow Zipf's law. The primary conclusion of the law illustrates that there are some very common tokens against a large number of symbols which are not used frequently. Based on this assumption, we can postulate that a score based on frequencies might be able to measure the peculiarity of a token. The infamous TF-IDF ranking functions and their derivatives [23, 310] have performed well in linguistics for

**77**

many years and are still widely used in retrieval systems [30]. However, since we split every expression into its subexpressions, we generated an anomalous bias towards shorter, i.e., less complex, formulae. Hence, distributions of subexpressions may not obey Zipf's law.

Figure 3.3: Each fgure illustrates the relationship between the frequency ranks (*x*-axis) and the normalized frequency (*y*-axis) in zbMATH (top) and arXiv (bottom). For arXiv, only the frst 8 million entries are plotted to be comparable with zbMATH (≈ 8.5 million entries). Subfgure (a) shades the hexagonal bins from green to yellow using a logarithmic scale according to the number of math expressions that fall into a bin. The dashed orange line represents Zipf's distribution (3.10). The values for *α* and *β* are provided in the plots. Subfgure (b) shades the bins from blue to red according to the maximum complexity in each bin.

Figure 3.3 visualizes a comparison between Zipf's law and the frequency distributions of mathematical subexpressions in arXiv and zbMATH. The dashed orange line visualizes the power law (3.10). The plots demonstrate that the distributions in both datasets obey this power law. Interestingly, there is not much diference in the distributions between both datasets. Both distributions seem to follow the same power law, with *α* = 1*.*3 and *β* = 15*.*82. Moreover, we can observe that the developed complexity measure seems to be appropriate, since the complexity distributions for formulae are similar to the distributions for the length of words [297]. In other words, more complex formulae, as well as long words in natural languages, are generally more specialized and thus appear less frequent throughout the corpus. Note that colors of the bins for complexities fuctuate for rare expressions because the color represents the maximum rather than the average complexity in each bin.

### **3.2.3.2 Analyzing and Comparing Frequencies**

Figure 3.4 shows in detail the most frequently used mathematical expressions in arXiv for the complexities 1 to 7. The orange dashed line visible in all graphs represents the normal Zipf's law distribution from Equation (3.9). We explore the total frequency values without any normalization. Thus, Equation (3.9) was multiplied by the highest frequency for each complexity level to ft the distribution. The plots in Figure 3.4 demonstrate that even though the parameter *α* varies between 0*.*35 and 0*.*62, the distributions in each complexity class also obey Zipf's law.

The plots for each complexity class contain some interesting fuctuations. We can spot a set of fve single identifers that are most frequently used throughout arXiv: *n*, *i*, *x*, *t*, and *k*. Even though the distributions follow Zipf's law accurately, we can explore that these fve identifers are proportionally more frequently used than other identifers and clearly separate themselves above the rest (notice the large gap from *k* to *a*). All of the fve identifers are known to be used in a large variety of scenarios. Surprisingly, one might expect that common pairs of identifers would share comparable frequencies in the plots. However, typical pairs, such as *x* and *y*, or *α* and *β*, possess a large discrepancy.

The plot of complexity two also reveals that two expressions are proportionally more often used than others: (*x*) and (*t*). These two expressions appear more than three times as often in the corpus than any other expression of the same complexity. On the other hand, the quantitative diference between (*x*) and (*t*) is negligible. We may assume that arXiv's primary domain, physics, causes the quantitative disparity between (*x*), (*t*), and the other tokens. The primary domain of the dataset becomes more clearly visible for higher complexities, such as *SU*(2) (C327) or *kms*−<sup>1</sup> (C4).

Another surprising property of arXiv is that symmetry groups, such as *SU*(2), appear to play an essential role in the majority of articles on arXiv, see *SU*(2) (C3), *SU*(2)*<sup>L</sup>* (C4), and *SU*(2) <sup>×</sup> *SU*(2) (C5), among others. The plots of higher complexities28, made this even more noticeable. Given a complexity of six, for example, the most frequently used expression was *SU*(2)*<sup>L</sup>* × *SU*(2)*R*, and for a complexity of seven it was *SU*(3) × *SU*(2) × *U*(1). Given a complexity of eight, ten out of the top-12 expressions were from symmetry group calculations.

It is also worthwhile to compare expressions among diferent levels of complexities. For instance, (*x*) and (*t*) appeared almost six million times in the corpus, but *f*(*x*) (at position three in C3) was the only expression which contained one of these most common expressions. Note that subexpressions of variations, such as (*x*0), (*t*0), or (*t* − *t* - ), do not match the expression of complexity two. This may imply that (*x*), and especially (*t*), appear in many diferent scenarios. Further, we can examine that even though (*x*) is a part of *f*(*x*) in only approximately 3% of all cases, it is still the most likely combination. These results are especially useful for recommendation systems that make use of math as input. Moreover, plagiarism detection

**79**

<sup>27</sup>We refer to a given complexity *n* with C*n*, i.e., C3 refers to complexity 3.

<sup>28</sup>More plots showing higher complexities are available at https : / / github . co m /ag - gipp / For m ulaCloudData [accessed 2021-10-01]

Figure 3.4: Overview of the most frequent mathematical expressions in arXiv for complexities 1-7. The color gradient from yellow to blue represents the frequency in the dataset. Zipf's law (3.9) is represented by a dashed orange line.

systems may also beneft from such a knowledge base. For instance, it might be evident that *f*(*x*) is a very common expression, but for automatic systems that work on a large scale, it is not clear whether duplicate occurrences of *f*(*x*) or Ξ(*x*) should be scored diferently, e.g., in the case of plagiarism detection.

Figure 3.4 shows only the most frequently occurring expressions in arXiv. Since we already explored a bias towards physics formulae in arXiv, it is worth comparing the expressions present within both datasets. Figure 3.5 compares the 25-top expressions for the complexities one to six. In zbMATH, we discovered that computer science and graph theory appeared as popular topics, see for example *G* = (*V,E*) (in C3 at position 20) and the Bachmann-Landau notations in *O*(log *n*), *O*(*n*2), and *O*(*n*3) (C4 positions 2, 3, and 19).

From Figure 3.5, we can also deduce useful information for MathIR tasks which focus on semantic information. Current semantic extraction tools [330] or LATEX parsers [18] still have difculties distinguishing *multiplications* from *function calls*. For example as mentioned before, L ATExml [257] adds an *invisible times* character between *f*(*x*) rather than a *function application*. Investigating the most frequently used termsin zbMATH in Table 3.5 reveals that *u* is most likely considered to be a function in the dataset: *u*(*t*) (rank 8), *u*(*x*) (rank 13), *uxx* (rank 16), *u*(0) (rank 17), |∇*u*| (rank 22). Manual investigations of extended lists reveal even more hits: *u*0(*x*) (rank 30), −Δ*u* (rank 32), and *u*(*x, t*) (rank 33). Since all eight terms are among the most frequent 35 entries in zbMATH, it implies that *u* can most likely be considered to imply a function in zbMATH. Of course, this does not imply that *u* must always be a function in zbMATH (see *f*(*u*) on rank 14 in C3), but this allows us to exploit probabilities for improving MathIR performance. For instance, if not stated otherwise, *u* could be interpreted as a function by default, which could help increase the precision of the aforementioned tools.

Figure 3.5 also demonstrates that our two datasets diverge for increasing complexities. Hence, we can assume that frequencies of less complex formulae are more topic-independent. Conversely, the more complex a math formula is, the more context-specifc it is. In the following, we will further investigate this assumption by applying TF-IDF rankings on the distributions.

### **3.2.4 Relevance Ranking for Formulae**

Zipf's law encourages the idea of scoring the relevance of words according to their number of occurrences in the corpus and in the documents. The family of BM25 ranking functions based on TF-IDF scores are still widely used in several retrieval systems [30, 310]. Since we demonstrated that mathematical formulae (and their subexpressions) obey Zipf's law in large scientifc corpora, it appears intuitive to also use TF-IDF rankings, such as a variant of BM25, to calculate their relevance.

### **Okapi BM25**

In its original form [310], *Okapi BM25* was calculated as follows

$$\text{bm25}(t, d) := \frac{(k+1)\,\text{IDF}(t)\,\text{TF}(t, d)}{\text{TF}(t, d) + k\left(1 - b + \frac{b|d|}{\text{AVG}\_{\text{DL}}}\right)}.\tag{3.11}$$

Figure 3.5: The top-20 and 25 most frequent expressions in arXiv (left) and zbMATH (right) for complexities 1-6. A line between both sets indicates a matching set. Bold lines indicate that the matches share a similar rank (distance of 0 or 1).

Here, TF (*t, d*) is the term frequency of *t* in the document *d*, |*d*| the length of the document *d* (in our case, the number of subexpressions), AVGDL the average length of the documents in the corpus (see Table 3.4), and IDF (*t*) is the inverse document frequency of *t*, defned as

$$\text{IDF}(t) := \log \frac{N - n(t) + \frac{1}{2}}{n(t) + \frac{1}{2}},\tag{3.12}$$

where *N* is the number of documents in the corpus and *n*(*t*) the number of documents which contain the term *t*. By adding <sup>1</sup> <sup>2</sup> , we avoid log 0 and division by 0. The parameters *k* and *b* are free, with *b* controlling the infuence of the normalized document length and *k* controlling the infuence of the term frequency on the fnal score. For our experiments, we chose the standard value *k* = 1*.*2 and a high impact factor of the normalized document length via *b* = 0*.*95.

As a result of our subexpression extraction algorithm, we generated a bias towards low complexities. Moreover, longer documents generally consist of more complex expressions. As demonstrated in Section 3.2.2.1, a document that only consists of the single expression *<sup>P</sup>*(*α,β*) *<sup>n</sup>* (*x*), i.e., the document had a length of one, would generate eight subexpressions, i.e., it results in a document length of eight. Thus, we modify the BM25 score in Equation (3.11) to emphasize higher complexities and longer documents. First, the average document length is divided by the average complexity AVG*<sup>C</sup>* in the corpus that is used (see Table 3.4), and we calculate the reciprocal of the document length normalization to emphasize longer documents.

Moreover, in the scope of a single document, we want to emphasize expressions that do not appear frequently in this document, but are the most frequent among their level of complexity. Thus, less complex expressions are ranked more highly if the document overall is not very complex. To achieve this weighting, we normalize the term frequency of an expression *t* according to its complexity *c*(*t*) and introduce an inverse term frequency according to all expressions in the document. We defne the inverse term frequency as

$$\text{ITF}(t,d) := \log \frac{|d| - \text{TF}(t,d) + \frac{1}{2}}{\text{TF}(t,d) + \frac{1}{2}}. \tag{3.13}$$

### **Definition of the importance score of a formula in a document**

Finally, we defne the score s(*t, d*) of a term *t* in a document *d* as

$$s(t,d) := \frac{(k+1)\operatorname{IDF}(t)\operatorname{ITF}(t,d)\operatorname{TF}(t,d)}{\max\_{t' \in d|\_{c(t)}} \operatorname{TF}(t',d) + k\left(1 - b + \frac{b\operatorname{AVG}\_{\text{DL}}}{|d|\operatorname{AVG}\_{C}}\right)}.\tag{3.14}$$

The TF-IDF ranking functions and the introduced s(*t, d*) are used to retrieve relevant documents for a given search query. However, we want to retrieve relevant subexpressions over a set of documents.

### **Definition of the Mathematical BM25**

Thus, we defne the score of a formula (mBM25) over a set of documents as the maximum score over all documents

$$\text{mBM25}(t, d) := \max\_{d \in D} \text{s}(t, d), \tag{3.15}$$

where *D* is a set of documents.

We used *Apache Flink* [157] to count the expressions and process the calculations. Thus, our implemented system scales well for large corpora.

Table 3.6 shows the top-7 scored expressions, where *D* is the entire zbMATH dataset. The retrieved expressions can be considered as meaningful and real-world examples of MOIs, since most expressions are known for specifc mathematical concepts, such as Gal(Q*/*Q), which refers to the Galois group of Q over Q, or *L*2(R2), which refers to the *L*2-space (also known as *Lebesgue space*) over R2. However, a more topic-specifc retrieval algorithm is desirable. To achieve this goal, we (i)

Table 3.5: Settings for the retrieval experiments.


retrieved a topic-specifc subset of documents *D<sup>q</sup>* ⊂ *D* for a given textual search query *q*, and (ii) calculated the scores of all expressions in the retrieved documents. To generate *Dq*, we indexed the text sources of the documents from arXiv and zbMATH via Elasticsearch (ES)<sup>29</sup> and performed the pre-processing steps: fltering stop words, stemming, and ASCII-folding30. Table 3.5 summarizes the settings we used to retrieve MOIs from a topic-specifc subset of documents *Dq*. We also set a minimum hit frequency according to the number of retrieved documents an expression appears in. This requirement flters out uncommon notations.

Figure 3.6 shows the results for fve search queries. We asked a domain expert from the NIST to annotate the results as related (shown as green dots in Figure 3.6) or non-related (red dots). We found that the results range from good performances (e.g., for the Riemann zeta function) to bad performances (e.g., beta function). For instance, the results for the Riemann zeta function are surprisingly accurate, since we could discover that parts of Riemann's hypothesis<sup>31</sup> were ranked highly throughout the results (e.g., *ζ*( <sup>1</sup> <sup>2</sup> + *it*)). On the other hand, for the beta function, we retrieved only a few related hits, of which only one had a strong connection to the beta function *B*(*x, y*). We observed that the results were quite sensitive to the chosen settings (see Table 3.5). For instance, according to the beta function, the minimum hit frequency has a strong efect on the results, since many expressions are shared among multiple documents. For arXiv, the expressions *B*(*α, β*) and *B*(*x, y*) only appear in one document of the retrieved 40. However, decreasing the minimum hit frequency would increase noise in the results.

<sup>29</sup>https://github.com/elastic/elasticsearch [accessed 2019-09-01]. We used version 7.0.0

<sup>30</sup>This means that non-ASCII characters are replaced by their ASCII counterparts or will be ignored if no such counterpart exists.

<sup>31</sup>Riemann proposed that the real part of every non-trivial zero of the Riemann zeta function is 1*/*2. If this hypothesis is correct, all the non-trivial zeros lie on the critical line consisting of the complex numbers 1*/*2 + *it*.

Figure 3.6: Top-20 ranked expressions retrieved from a topic-specifc subset of documents *Dq*. The search query *q* is given above the plots. Retrieved formulae are annotated by a domain expert with green dots for relevant and red dots for non-relevant hits. A line is drawn if a hit appears in both result sets. The line is colored in green when the hit was marked as relevant.

**85**



Even though we asked a domain expert to annotate the results as relevant or not, there is still plenty of room for discussion. For instance, (*x*+*y*) (rank 15 in zbMATH, 'Beta Function') is the argument of the gamma function Γ(*x* + *y*) that appears in the defnition of the beta function [98, (5.12.1)] *B*(*x, y*) := Γ(*x*)Γ(*y*)*/*Γ(*x* + *y*). However, this relation is weak at best, and thus might be considered as not related. Other examples are Re*z* and Re(*s*), which play a crucial role in the scenario of the Riemann hypothesis (all non-trivial zeroes have Re(*s*) = <sup>1</sup> <sup>2</sup> ). Again, this connection is not obvious, and these expressions are often used in multiple scenarios. Thus, the domain expert did not mark the expressions as being related.

Considering the diferences in the documents, it is promising to have observed a relatively high number of shared hits in the results. Further, we were able to retrieve some surprisingly good insights from the results, such as extracting the full defnition of the Riemann zeta function [98, (25.2.1)] *ζ*(*s*) := <sup>∞</sup> *<sup>n</sup>*=1 <sup>1</sup> *<sup>n</sup><sup>s</sup>* . Even though a high number of shared hits seem to substantiate the reliability of the system, there were several aspects that afected the outcome negatively, from the exact defnition of the search queries to retrieve documents via ES, to the number of retrieved documents, the minimum hit frequency, and the parameters in mBM25.

### **3.2.5 Applications**

The presented results are benefcial for a variety of use-cases. In the following, we will demonstrate and discuss several of the applications that we propose.

**Extension of zbMATH's Search Engine** Formula search engines are often counterintuitive when compared to textual search, since the user must know how the system operates to enter a search query properly (e.g., does the system supports LATEX inputs?). Additionally, mathematical concepts can be difcult to capture using only mathematical expressions. Consider, for example, someone who wants to search for mathematical expressions that are related to eigenvalues. A textual search query would only retrieve entire documents that require further investigation to fnd related expressions. A mathematical search engine, on the other hand, is impractical since it is not clear what would be a ftting search query (e.g., *Av* = *λv*?). Moreover, formula and textual search systems for scientifc corpora are separated from each other. Thus, a textual search engine capable of retrieving mathematical formulae can be benefcial. Also, many search engines allow for narrowing down relevant hits by suggesting flters based on the retrieved results. This technique is known as faceted search. The zbMATH search engine also provides faceted search, e.g., by authors, or year. Adding facets for mathematical expressions allows users to narrow down the results more precisely to arrive at specifc documents.

Our proposed system for extracting relevant expressions from scientifc corpora via mBM25 scores can be used to search for formulae even with textual search queries, and to add more flters for faceted search implementations. Table 3.7 shows two examples of such an extension for zbMATH's search engine. Searching for 'Riemann Zeta Function' and 'Eigenvalue' retrieved 4,739 and 25,248 documents from zbMATH, respectively. Table 3.7 shows the most frequently used mathematical expressions in the set of retrieved documents. It also shows the reordered formulae according to a default TF-IDF score (with normalized term frequencies) and our proposed mBM25 score. The results can be used to add flters for faceted search, e.g., show only the documents which contain *<sup>u</sup>* <sup>∈</sup> *<sup>W</sup>*1*,p* <sup>0</sup> (Ω). Additionally, the search system now provides more intuitive textual inputs even for retrieving mathematical formulae. The retrieved formulae are also interesting by themselves, since they provide insightful information on the retrieved publications. As already explored with our custom document search system in Figure 3.6, the Riemann hypothesis is also prominent in these retrieved documents.

The diferences between TF-IDF and mBM25 ranking illustrates the problem of an extensive evaluation of our system. From a broader perspective, the hit *Ax* = *λBx* is highly correlated with the input query 'Eigenvalue'. On the other hand, the raw frequencies revealed a prominent role of div(|∇*u*| *<sup>p</sup>*−<sup>2</sup> <sup>∇</sup>*u*). Therefore, the top results of the mBM25 ranking can also be considered as relevant.

**Math Notation Analysis** A faceted search system allows us to analyze mathematical notations in more detail. For instance, we can retrieve documents from a specifc time period. This allows one to study the evolution of mathematical notation over time [54], or for identifying trends in specifc felds. Also, we can analyze standard notations for specifc authors since it is often assumed that authors prefer a specifc notation style which may vary from the standard notation in a feld.

Table 3.7: The top-5 frequent mathematical expressions in the result set of zbMATH for the search queries 'Riemann Zeta Function' (top) and 'Eigenvalue' (bottom) grouped by their complexities (left) and the hits reordered according to their relevance scores (right). The TF-IDF score was calculated with normalized term frequencies.



Table 3.8: Suggestions to complete '*E* = *m*' and '*E* = {*m, c*}' (the right-hand side contains *m* and *c*) with term and document frequency based on the distributions of formulae in arXiv.

**Math Recommendation Systems** The frequency distributions of formulae can be used to realize efective math recommendation tasks, such as type hinting or error-corrections. These approaches require long training on large datasets, but may still generate meaningless results, such as *<sup>G</sup><sup>i</sup>* <sup>=</sup> {(*x, y*) <sup>∈</sup> <sup>R</sup>*<sup>n</sup>* : *<sup>x</sup><sup>i</sup>* <sup>=</sup> *<sup>x</sup>i*} [400]. We propose a simpler system which takes advantage of our frequency distributions. We retrieve entries from our result database, which contain all unique expressions and their frequencies. We implemented a simple prototype that retrieves the entries via pattern matching. Table 3.8 shows two examples. The left side of the table shows suggested autocompleted expressions for the query '*E* = *m*'. The right side shows suggestions for '*E* =', where the right-hand side of the equation should contain *m* and *c* in any order. A combination using more advanced retrieval techniques, such as similarity measures based on symbol layout trees [92, 407], would enlarge the number of suggestions. This kind of autocomplete and error-correction type-hinting system would be benefcial for various use-cases, e.g., in educational software or for search engines as a pre-processing step of the input.

**Plagiarism Detection Systems** As previously mentioned, plagiarism detection systems would beneft from a system capable of distinguishing conventional from uncommon notations [253, 254, 334]. The approaches described by Meuschke et al. [254] outperform existing approaches by considering frequency distributions of single identifers (expressions of complexity one). Considering that single identifers make up only 0*.*03% of all unique expressions in arXiv, we presume that better performance can be achieved by considering more complex expressions. The conferred string representation also provides a simple format to embed complex expressions in existing learning algorithms.

Expressions with high complexities that are shared among multiple documents may provide further hints to investigate potential plagiarisms. For instance, the most complex expression that was shared among three documents in arXiv was Equation (3.7). A complex expression being identical in multiple documents could indicate a higher likelihood of plagiarism. Further investigation revealed that similar expressions, e.g., with infnite sums, are frequently used among a larger set of documents. Thus, the expression seems to be a part of a standard notation that is commonly shared, rather than a good candidate for plagiarism detection. Resulting from manual investigations, we could identify the equation as part of a concept called *generalized Hardy-Littlewood inequality* and Equation (3.7) appears in the three documents [24, 292, 304]. All

**89**

Figure 3.7: The top ranked expression for '*Jacobi polynomial*' in arXiv and zbMATH. For arXiv, 30 documents were retrieved with a minimum hit frequency of 7.

three documents shared one author in common. Thus, this case also demonstrates a correlation between complex mathematical notations and authorship.

**Semantic Taggers and Extraction Systems** We previously mentioned that semantic extraction systems [214, 329, 330] and semantic math taggers [71, 402] have difculties in extracting the essential components (MOIs) from complex expressions. Considering the defnition of the Jacobi polynomial in Equation (3.2), it would be benefcial to extract the groups of tokens that belong together, such as *<sup>P</sup>*(*α,β*) *<sup>n</sup>* (*x*) or Γ(*<sup>α</sup>* <sup>+</sup> *<sup>m</sup>* + 1). With our proposed search engine for retrieving MOIs, we are able to facilitate semantic extraction systems and semantic math taggers. Imagine such a system being capable of identifying the term 'Jacobi polynomial' from the textual context. Figure 3.7 shows the top relevant hits for the search query 'Jacobi polynomial' retrieved from zbMATH and arXiv. The results contain several relevant and related expressions, such as the constraints *α, β >* −1 and the weight function for the Jacobi polynomial (1 <sup>−</sup> *<sup>x</sup>*)*α*(1 + *<sup>x</sup>*)*β*, which are essential properties of this orthogonal polynomial. Based on these retrieved MOIs, the extraction systems can adjust its retrieved math elements to improve precision, and semantic taggers or a tokenizer could re-organize parse trees to more closely resemble expression trees.

### **3.2.6 Outlook**

In this frst study, we preserved the core structure of the MathML data which provided insightful information for the MathML community. However, this makes it difcult to properly merge formulae. In future studies, we will normalize the MathML data via MathMLCan [117]. In addition to this normalization, we will include wildcards for investigating distributions of formula patterns rather than exact expressions. This will allow us to study connections between math objects, e.g., between Γ(*z*) and Γ(*x*+1). This would further improve our recommendation system and would allow for the identifcation of regions for parameters and variables in complex expressions.

### **3.3 Semantification with Textual Context Analysis**

The results of our math embedding experiments and the introduction of MOI motivates us to develop a context-sensitive LATEX to CAS translation approach around the MOI concept. In this section, we briefy discuss our novel approach to perform context-sensitive translations from L ATEX to CAS, which concludes research task **II**. We focus on three main sources of semantic information to disambiguate mathematical expressions sufciently for such translations:


The frst source is what most existing translators rely on by concluding the semantics from a given structure. The second source is rather broad. The necessary information can be given in the sentences before and after an equation, somewhere in the same article, or even through references (e.g., hyperlinks in Wikipedia articles or citations in scientifc publications). In this thesis, we will focus on the textual context in a single document, i.e., we do not analyze references or deep links to other articles yet. The last source can be considered a backup option. If we cannot retrieve information from the context of a formula, the semantic meaning of a formula might be considered common knowledge, such as *π* referring to the mathematical constant.

We extract knowledge from each of the three sources with diferent approaches. For the inclusive structural information, we rely on the semantic LATEX macros developed by Miller [260] for the DLMF that defne standard notation patterns for numerous OPSF. To analyze the textual context of a formula, we rely on the approach proposed by Schubotz et al. [330], who extracted noun phrases to enrich identifers semantically. As a backup common knowledge database, we use the POM tagger developed by Youssef [402] that relies on manually crafted lexicon fles with several common knowledge annotations for mathematical tokens.

### **3.3.1 Semantification, Translation & Evaluation Pipeline**

Figure 3.8 illustrates the pipeline of the proposed system to convert generic LATEX expressions to CAS. The fgure contains numbered badges that represent the diferent steps in the system. Steps 1-4 represent the conversion pipeline, while steps 5-7 are diferent ways to evaluate the system.

Figure 3.8: Pipeline of the proposed context-sensitive conversion process. The pipeline consists of four semantifcation steps (1-4) and three evaluation approaches (5-7).

The conversion pipeline starts with *mathosphere*<sup>32</sup> (step **1a** ). Mathosphere is the Java framework developed by Schubotz et al. [279, 329, 330] in a sequence of publications to semantically enrich mathematical identifer with defning phrases from the textual context. First, we will modify mathosphere so that it extracts MOI-defniens pairs rather than single identifers (step **1b** ). For this purpose, we propose the following signifcant simplifcation: an isolated mathematical expression in a textual context is considered essential and informative. Hence, *isolated formulae* are defned as MOI. Moreover, mathosphere scores identifer-defniens pairs in regard of their frst appearance in a document (since the frst declaration of a symbol often remains valid throughout the rest of the document [394]). We adopt this scoring for MOI with a matching algorithm that allows us to identify MOI within other MOI in the same document (step **1c** ).

Step **2** is currently optional and combines the results from the MOI-defniens extraction process with the common knowledge database of the POM tagger. The information can then be used to feed existing LATEX to MathML converters with additional semantic information. In Chapter 2, we created a MathML benchmark, called MathMLben, to evaluate such converters. We have also shown that, for example, LATExml can adopt additional semantic information via given semantic macros. Hence, via step **4** (and subsequently step **5** ) we can evaluate our semantifcation so far with the help of existing converters. The steps **2** , **4** , and **5** are not subject of this thesis but part of upcoming projects.

<sup>32</sup>https://github.com/ag-gipp/mathosphere [accessed 03-24-2020]

Besides this optional evaluation over MathMLben, we continue our main translation path. Once we extracted the MOI-defniens pairs, we replace the generic LATEX expressions by their semantic counterparts (step **3** ). We do so by indexing semantic LATEX macros so that we can search for them by textual queries. Afterward, we are able to retrieve semantic LATEX macros by the previously extracted defniens. Finally, we create replacement patterns so that the generic L ATEX expression can be replaced with the semantic enriched semantic macros from the DLMF. The result should be semantic LATEX, which enables another evaluation method. Consider we perform this pipeline on the DLMF, we can compare the generated semantic LATEX with the original, manually crafted semantic LATEX source in the DLMF to validate its correctness (step **6** ). Unfortunately, the entire pipeline focuses on the textual context. The DLMF does not provide sophisticated textual information because semantic information is available via special infoboxes, through hyperlinks, or in tables and graphs. A more comprehensive evaluation approach can be enabled by further translating the expressions to the syntax of CAS via LACAST as we have shown in previous projects [2] (step **7** ), namely symbolic and numeric evaluations. Moreover, this evaluation is most desired since it evaluates the entire proposed translation pipeline, from the semantifcation via mathosphere and the semantic LATEX macros, and the fnal translation via LACAST. The next chapter will aim to realize this proposed pipeline. The steps **1** and **3** are discussed in Chapter 4. The step **7** is subject of Chapter 5. Step **6** has not been realized due to the reduced amount of textual context within the DLMF. Steps **2** , **4** , and **5** are subject of future work.

This Chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).

*The frst time someone calls you a horse you punch him on the nose, the second time someone calls you a horse you call him a jerk but the third time someone calls you a horse, well then perhaps it's time to go shopping for a saddle.*

Shlomo - *Lucky Number Slevin*

### **CHAPTER 4**

### **From LaTeX to Computer Algebra Systems**

### **Contents**


This chapter addresses research tasks **III** and **IV**, i.e., implementing a system for automated semantifcation and translation of mathematical expressions to CAS syntax. In the previous chapter, we laid the foundation for a novel context-sensitive semantifcation approach that extracts the semantic information from a textual context and semantically enriches a formula with semantic LATEX macros. In this chapter, we realize this proposed semantifcation approach on 104 English Wikipedia articles with 6*,*337 mathematical expressions. However, before we continue with this main track, we frst apply a novel context-agnostic machine translation approach for translations from LATEX to Mathematica.

Previously, we have evaluated that rule-based translators are rather limited. Mostly because the rules are carefully selected and manually crafted. This manual approach makes it difcult to estimate the level of semantics that can be concluded directly from an expression (due to its structure, notation style, or the including symbols). Finding patterns in large data is a classic task for ML solutions. Hence, we will frst elaborate the efectiveness of a machine translation approach in Section 4.1. We will see that the machine translation approach is very efective in adopting the notation style generated by Mathematica's LATEX exports but fails to generalize the

**Supplementary Information** The online version contains supplementary material available at https://doi.org/10.1007/978-3-658-40473-4\_4.

© The Author(s) 2023 A. Greiner-Petter, *Making Presentation Math Computable*, https://doi.org/10.1007/978-3-658-40473-4\_4

trained patterns on real world scenarios or other libraries. A qualitative evaluation on the DLMF of the same model underlines the inappropriateness of the approach for a general translator. Nonetheless, the model still outperforms Mathematica's internal LATEX import function.

The machine translation approach presented in Section 4.1 partially contains excerpts of our1 upcoming submission to the ACL Conference 2023.The Section 4.2 has been accepted for publication in the upcoming issue of the IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) [11] journal. In order to provide a coherent story line, Section 4.2 only presents the frst half of the TPAMI submission. The second half, the evaluation and discussion sections, subsequently continues in Chapter 5.

### **4.1 Context-Agnostic Neural Machine Translation**

Mathematical formulae are generally longer compared to natural language sentences. 98% of the sentences in the Stanford Natural Language Inference (SNLI) entailment task, for example, contain less than 25 words [48]. In contrast, the average number of Mathematica tokens in the Mathematical Functions Site (MFS) dataset is 173. Short and long expressions are relatively rare but have a wider range compared to natural language sentences, e.g., 2*.*25% contain less than 25 tokens and 2*.*1% contain more than 1*,*024 tokens. Meanwhile, a vocabulary of such a mathematical language only contains 1*k* tokens compared to 60*k* tokens for a news classifcation model [410]2.

The most common neural machine translation models are sequence-to-sequence recurrent neural networks [355], tree-structured recursive neural networks [136], transformer sequenceto-sequence networks [371], and convolutional sequence-to-sequence networks [130]. For natural language translation tasks, transformer networks are known to outperform the others [130, 277, 371]. In this Section, we use convolutional sequence-to-sequence networks [130] since they perform better on our mathematical language. In regard of related work, only a few approaches for mathematical language translations exists [95, 219, 275, 296, 373, 375, 376, 379].

### **4.1.1 Training Datasets & Preprocessing**

We used two datasets for our experiments: the Mathematical Functions Site (MFS)<sup>3</sup> and parts of the DLMF. For the MFS, we fetched all formulea in Mathematica's InputForm<sup>4</sup> and exported every expression with Mathematica's internal TeXForm<sup>5</sup> export function. This process generated 307*,*409 expression pairs in LATEX and Mathematica notation. We do the same for the DLMF dataset but use LATExml for the conversion from semantic LATEX to LATEX. From the DLMF, we generated 11*,*605 pairs in LATEX and semantic LATEX notation. Note that LATExml and Mathematica's TeXForm are rule-based translators. Hence, the generated data is limited to the abilities of the methods we used. Finally, we parsed the data in binary trees in postfx notation with the help of a custom rule-based tokenizer for LATEX and Mathematica expressions.

<sup>1</sup> The translator is a project by Felix Petersen and was supervised by Moritz Schubotz and assisted by me. In particular, I evaluated the model on the DLMF dataset and helped to devise the fnal paper for publication. The section has been mostly rewritten and shortened to the main fndings for avoiding conficts.

<sup>2</sup> Even though more recent discussions argue that such large vocabularies are often not required and can be signifcantly reduced in size without a dramatic decrease in the model's performance [70].

<sup>3</sup> http://functions.wolfram.com/ [accessed 2021-09-20]

<sup>4</sup> https://reference.wolfram.com/language/ref/InputForm.html [accessed 2021-09-20]

<sup>5</sup> https://reference.wolfram.com/language/ref/TeXForm.html [accessed 2021-09-20]

### **4.1.2 Methodology**

Besides our fnal convolutional sequence-to-sequence model [130], we also experimented with Long-Short-Term-Memory (LSTM) recurrent networks [369], recurrent, recursive, and transformer neural networks [130, 277, 371], and *LightConv* [397] as an alternative to the classic convolutional sequence-to-sequence models [130]. However, our model outperformed all other approaches. In the following, we list the hyperparameters and additional design choices that performed best for our experiments.

We use


Since the MFS dataset contains more than 10<sup>4</sup> multi-digit numbers (in contrast to less than 10<sup>3</sup> non-numerical tags), these numbers cannot be interpreted as conventional tags. Thus, numbers are either split into single digits or replaced by variable tags. Splitting numbers into single digits causes signifcantly longer token streams, which degrades performance. Substituting all multi-digit numbers with tags like <number\_01> improved the exact match accuracy of the validation data set from 92*.*7% to 95*.*0%. We use a total of 32 of such placeholder tags as more than 99% of the formulae have less or equal to 32 multi-digit numbers. We randomly select the tags that we substitute the numbers with. Since multi-digit numbers basically always perfectly correspond in the diferent mathematical languages, we directly replace the tag with their corresponding numbers after the translation. Lastly, we split the MFS dataset into 97% training data, 0*.*5% validation data, 2*.*5% test data and split the semantic LATEX data set into 90% training data, 5% validation data, and 5% test data since this set is smaller.

### **4.1.3 Evaluation of the Convolutional Network**

In the following, we use three evaluation metrics: Exact Matches (EM), Levenshtein distance [227], and Bilingual Evaluation Understudy (BLEU) [282]. The EM and the Levenshtein distance are calculated on the comparison of the sequence of Mathematica and TEX tokens. Hence even LATEX equivalent expressions, such as **E=mcˆ2** and **E=mcˆ{2}**, are not considered as an EM. Due to the two additional curly brackets, the Levenshtein distance between both expressions is 2. We further denote the share of translations that have a Levenshtein distance of up to <sup>5</sup> by LD≤5, and denote the average Levenshtein Distance by LD.

The BLEU score is a quality measure that compares the machine's output to a translation by a professional human translator. It compares the *n*-grams (specifcally *n* = 1*,...,* 4) between the prediction and the ground truth. Since the translations in the data sets are ground truth values instead of human translations, for the back-translation of formulae, this metric refects the closeness to the ground truth. BLEU scores range from 0 to 100, with the higher value

**97**

indicating the better result. For a comparison to natural languages, state-of-the-art translators from English to German reach a 35*.*0 and from English to French a 45*.*6 BLEU score [102]. That the BLEU scores for formula translations are signifcantly higher than the scores for natural language can be attributed to the larger vocabularies in natural language and a considerably higher variability between correct translations.

In addition to this, we also perform round trip experiments from LATEX into Mathematica and back again on the im2latex-100k<sup>6</sup> dataset [95]. This dataset consists about 100*k* formulae from papers of arXiv, including their renderings. The im2latex-100k task's concept was the conversion of mathematical formulae from images into LATEX via OCR. We use it as an additional source for more general mathematical expressions instead. For our round trip experiment, we translate all LATEX expressions into Mathematica with the internal LATEX import function and our convolutional sequence to sequence model. Afterward, we use Mathematica's export function to generate LATEX again. Finally, we compare this round trip translated LATEX with the original input formula. Note that 66*.*8% of the equations in the im2latex-100k data set contain tokens that are not in our model's vocabulary.

### **4.1.3.1 Results**

Table 4.1 show the results of our convolutional sequence to sequence model for translations to Mathematica and semantic LATEX evaluated with the EM rate and the BLEU score. We achieved an EM accuracy of 95*.*1% and a BLEU score of 99*.*68 for translations to Mathematica. For the translation from LATEX to semantic LATEX, we achieved an EM accuracy of 90*.*7% and a BLEU score of 96*.*79. Table 4.2 we compare our model with Mathematica's internal TEX import function on the two datasets MFS and im2latex-100k. While the accuracy drops on a new dataset, our model still outperforms Mathematica's import function on all metrics. Lastly, for a more qualitative analysis, we evaluated our model on 100 random samples of DLMF formulae manually, i.e., we did not check the EM or BLEU score but a human annotator manually checked if a translation was correct or at least syntactically valid (which is the same as the previously used Import metric). All 100 samples and the results are available in Table E.1 in Appendix E.1 available in the electronic supplementary material. Table 4.3 show the comparison of our model with Mathematica's import function and our previously developed translator LACAST [13]. As we can see, on these random samples, Mathematica outperforms our model but LACAST performs best. Nonetheless, LACAST was specifcally designed for translations on the DLMF, which allows L ACAST to correctly anticipate the usage of constants, such as i for the imaginary unit or *e* for Euler's number.


6 https://paperswithcode.com/dataset/im2latex-100k [accessed 2021-09-21]


Table 4.2: Comparison between Mathematica and our model on backward translation of the formulae of the MFS and im2latex-100k dataset. Import denotes the fraction of formulae that can be imported by Mathematica, i.e., the translation was syntactically valid.

Table 4.3: Qualitative comparison between Mathematica, LACAST, and our model on 100 random DLMF samples. ✘ indicate wrong translations. ✔ indicate correct translations. As in Table 4.2, Import denotes syntactically valid translations. The full dataset is available in Appendix E.1 available in the electronic supplementary material.


### **4.1.3.2 Qalitative Analysis and Discussion**

We constitute that our model successfully outperforms Mathematica on various scenarios. A good example for this is the following equation7:

$$\wp\left(z;g\_2,g\_3\right) = -\frac{\sigma\left(z-z\_0;g\_2,g\_3\right)\sigma\left(z+z\_0;g\_2,g\_3\right)}{\sigma\left(z;g\_2,g\_3\right)^2\sigma\left(z\_0;g\_2,g\_3\right)^2}/z\_0 = \wp^{-1}\left(0;g\_2,g\_3\right). \tag{4.1}$$

The symbol *℘* (\wp) is properly interpreted by the model and Mathematica as the Weierstrass' elliptic function *℘* (WeierstrassP). That is because the symbol *℘* is uniquely tied to the Weierstrass *℘* function. The inverse of this function, *℘*−<sup>1</sup> is also properly interpreted by both systems as the InverseWeierstrassP. However, *σ* was not properly interpreted by Mathematica as the WeierstrassSigma presumably due to the ambiguity of *σ*. Considering the expression is from the MFS and *℘* appears in the same expressions, we can conclude that *σ* is referring to the WeierstrassSigma. Our model was able to capture this connection and correctly translate the entire expression.

The low scores of Mathematica on their own dataset can be attributed to the fact that Mathematica does not attempt to disambiguate its own exported expressions. As we discussed earlier, an export from a computational language to a presentation language loses semantic information. Our sequence to sequence model was able to restore the semantic information under the assumption that the input was generated from the MFS via Mathematica. Hence, our model performs very well on the trained data but is unable to produce reliable translations on

**99**

<sup>7</sup> Extracted from https:// functions . wol f ra m .co m /EllipticFunctions / WeierstrassP / introductions/Weierstrass/04/ [accessed 2021-09-14]



unseen, more general expressions. A frst hint to this problem can be found in Table 4.3 for our evaluation on the 100 DLMF formulae. While our model clearly outperforms Mathematica on the MFS dataset, the internal rule-based import function of Mathematica works more reliable on unknown expressions. One reason for the low performance of our model on the DLMF evaluation is our vocabulary. 71 of the 100 expressions contain tokens that are not in the Mathematica-export vocabulary. Hence, our model was unable to correctly interpret these expressions. This clearly underlines the limitation of the model. As an approach to mitigate this efect in the future, we could use multilingual translations [40, 174] which would allow learning translations and tokens that are not represented in the training data for the respective language pair.

Additionally, we must note that every dataset we used has a signifcant bias. The DLMF and MFS specifcally focus on OPSF. The im2latex-100k dataset was created from arXiv articles in the area of high energy physics8. A general limitation of neural networks is that trained models inherit biases from training data. For a successful formula translation, this means that the set of symbols, as well as the style in which the formulae are written, has to be present in the training data. Rather than learning the actual semantics of an expression, a model is able to capture the notation favor / convention another tool produces, such as Mathematica's export function or LATExml. The generated LATEX from both Mathematica and LATExml, is limited to a specifc vocabulary and does not allow variation as it is produced by rule-based translators.

<sup>8</sup> *Phenomenology* (hep-ph) and *Theory* (hep-th) specifcally.

Because of the limited vocabularies as well as limited set of LATEX conventions in the data sets, the translation of mathematical LATEX expressions of diferent favors is not possible.

Due to the performance on the MFS and im2latex-100k datasets, we conclude that our model captures more patterns compared to Mathematica's internal import methods. On the other hand, we have also shown that our model is unable to capture the semantic information of mathematical expressions but concludes semantics from patterns and token structures. Whether this semantics is correct or consistent with additional contextual information does not matter. Hence, our translation is rather unpredictable and susceptible to minor visual changes in the inputs. If we consider the simple examples from Table 1.2 from the introduction, we can see that our model is unable to correctly translate most expressions similar to Mathematica. Table 4.4 shows the translations for our model. Three of the translations even consists obvious syntax errors, such as unbalanced brackets. In comparison to the frst Table 1.2, we added three more examples to show that marginal changes may have a signifcant impact on the fnal translation. For example, simply changing the variable of integration from *x* to *t* in the frst examples changes the outcome from a syntactically and semantically invalid expression to a correct and valid translation. Similarly, additional curly brackets around the limits of an integral may cause a wrong translation and an error that can be difcult to trace back if not immediately noticed9.

Considering the simplicity of the expressions, a machine translation model alone might not be the correct approach for a reliable LATEX to CAS translator. Especially because such simple mistakes harms the trustworthiness of the entire engine. Since accuracy and precision are among the most important aspects in mathematics, our machine translator cannot be considered as compatible with existing rule-based approaches. A hybrid solution with ML-enhanced pattern recognition techniques and rule-based translations could be the more promising solution in the future.

### **4.2 Context-Sensitive Translation**

Since the previous section has shown that machine translations are not as reliable as rule-based approaches, we continue to develop a more reliable strategy following heuristics that have been developed over time by studying mathematical notations. Specifcally, we want to focus on a more broad source of mathematical expressions away from the strict notation guidelines in the DLMF and the less descriptive scientifc articles in arXiv. In the following, we will focus on Wikipedia articles as our primary source for mathematical expressions.

### **4.2.1 Motivation**

Like many other knowledge base systems, Wikipedia encodes mathematical formulae in a representational format similar to LATEX [156, 17, 405]. While this representational format is simple to comprehend by readers possessing the required mathematical training, an additional explicit knowledge of the semantics associated with each expression in a given formula, could make mathematical content in Wikipedia even more explainable, unambiguous, and most importantly, machine-readable. Additionally, making math machine-readable can allow even visually impaired individuals to receive a semantic description of the mathematical content. Finally, and crucially, moderating and curating mathematical content in a free and community-driven

<sup>9</sup> Here the variable of integration switched from *x* to *a* in the translated expression due to the redundant curly brackets around the limits of the integral. This error can be easily overlooked.


Figure 4.1: Mathematical semantic annotation in Wikipedia.

encyclopedia like Wikipedia, is more time-consuming and error-prone without explicit access to the semantics of a formula. Wikipedia currently uses the *Objective Revision Evaluation Service* (ORES) to predict the damaging or good faith intention of an edit using multiple independent classifers trained on diferent datasets [144]. The primary motivation behind ORES was to reduce the overwhelming workload of content moderation with machine learning classifcation solutions. Until now, the ORES system applies no special care to mathematical content. Estimating the trustworthiness of an edit in a mathematical expression is signifcantly more challenging for human curators and almost infeasible for Artifcial Intelligence (AI) classifcation models due to the complex nature of mathematics.

In this section, we propose a semantifcation and translation pipeline that makes the math in Wikipedia computable via CAS. CAS, such as Maple [36] and Mathematica [393], are complex mathematical software tools that allow users to manipulate, simplify, plot, and evaluate mathematical expressions. Hence, translating mathematics in Wikipedia to CAS syntaxes enables automatic verifcation checks on complex mathematical equations [2, 11]. Integrating such verifcations into the existing ORES system can signifcantly reduce the overload of moderating mathematical content and increasing credibility in the quality of Wikipedia articles at the same time [359]. Since such a translation is context-sensitive, we also propose a semantifcation approach for the mathematical content. This semantifcation uses semantic LATEX macros [260] from the DLMF [98] and noun phrases from the textual context to semantically annotate math formulae. The semantic encoding in the DLMF provides additional information about the components of a formula, the domain, constraints, links to defnitions, and improves searchability and discoverability of the mathematical content [260, 403]. Our semantifcation approach enables the features from the DLMF for mathematics in Wikipedia. Figure 4.1 provides an example vision of our semantic annotations and verifcation results in Wikipedia [17]. Head et al. [150] recently evaluated that providing readers information on the individual elements in mathematical expressions on-site [329, 394], such as shown in Figure 4.1, can signifcantly support users of all experience levels to read and comprehend articles more efciently [150].

Mathematics is not a formal language. Its interpretation heavily depends on the context, e.g., *π*(*x* + *y*)<sup>10</sup> can be interpreted as a multiplication *πx* + *πy* or the number of primes less than or equal to *x*+*y*. CAS syntaxes, on the other hand, are unambiguous content languages. Therefore, the main challenge to enable CAS verifcations for mathematical formulae in Wikipedia is a

<sup>10</sup>In the following, we use this color coding for examples to easily distinguish them from other mathematical content in this section.

reliable translation between an ambiguous, context-dependent format and an unambiguous, context-free CAS syntax. Hence, we derive the following research question:

### **Research Qestion**

What information is required to translate mathematical formulae from natural language contexts to CAS and how can this information be extracted?

In this section, we present the frst context-dependent translation from mathematical LATEX expressions to CAS, specifcally Maple and Mathematica. We show that a combination of nearby context analysis (extraction of descriptive terms) and a list of standard notations for common functions provide sufcient semantic information to outperform existing contextindependent translation techniques, such as CAS internal LATEX import functions. We achieve reliable translations in a four-step augmentation pipeline. These steps are: (1) pre-processing Wikipedia articles to enable natural language processing on it, (2) constructing an annotated mathematical dependency graph, (3) generating semantic enhancing replacement patterns, and (4) performing CAS-specifc translations (see Figure 4.2). In addition, we perform automatic symbolic and numeric computations on the translated expressions to verify equations from Wikipedia articles [2, 11]. We show that the system is capable of detecting potential errors in mathematical equations in Wikipedia articles. Future releases could be integrated into the ORES system to reduce vandalism and improve trust in mathematical articles in Wikipedia. We demonstrate the feasibility of the translation approach on English Wikipedia articles and provide access to an interactive demo of our *LaTeX to CAS translator* (LACAST)11.

For the evaluation of the translations, we focus on the sub-domain of OPSF. OPSF are generally well-supported by general-purpose CAS [13], which allows us to estimate the full potential of our proposed translation and verifcation pipeline. Since CAS syntaxes are programming languages, one has the option to add new functionality to a CAS, such as defning a new function. Defning new functions in CAS, however, can vary signifcantly in complexity. While translating a generic function like *f*(*x*) := *x*<sup>2</sup> is straightforward, defning the prime counting function from above could be very complex. If a function is explicitly declared in the CAS, we call a translation to that function *direct*. General mathematics often does not have such direct translations. For example, translating the generic function *f*(*x*) is meaningless without considering the actual defnition of *f*(*x*). Hence, we frst focus on translations of OPSF, which often have direct translations to CAS. In addition, OPSF are highly interconnected, i.e., many OPSF can be expressed (or even defned) in terms of other OPSF. One of the main tasks for our future work is to support more non-direct translations enabling our LACAST to handle more general mathematics.

In this section, we present our pipeline and discuss each of the augmentation steps. Section 4.2.2 discusses related work. In Section 4.2.3, we introduce a formal defnition for translating LATEX to CAS syntaxes. Section 4.2.4 explains necessary pre-processing steps for Wikipedia articles. Section 4.2.5 introduces our annotated dependency graph. Section 4.2.6 concludes by replacing generic LATEX subexpressions with semantically enriched macros from the DLMF. The evaluation and discussion subsequently continue in Chapter 5.

<sup>11</sup>https://tpami.wmflabs.org [accessed 2021-09-01]

### **4.2.2 Related Work**

Our proposed pipeline tangents several well-known tasks from MathIR, namely descriptive entity recognition for mathematical expressions [183, 213, 279, 320, 329], math tokenization [402], math dependency recognition [14, 214], and automatic verifcation [2, 11]. Existing approaches to translate mathematical formulae from presentational languages, e.g., LATEX or MathML, to content languages, e.g., content MathML or CAS syntax, do not analyze the context of a formula [14, 270, 18]. Hence, existing approaches to translate LATEX to CAS syntaxes are limited to simple arithmetic expressions [18] or require manual semantic annotations [14]. Some CAS, such as Mathematica, support LATEX imports. Those functions fall into the frst category [18] and are limited to rather simple expressions. A semantic annotation, on the other hand, can be directly encoded in LATEX via macros and allows for translations of more complex formulae. Miller et al. [260] developed a set of the previously mentioned semantic macros that link specifc mathematical expressions with defnitions in the DLMF [98]. The manually generated semantic data from the DLMF [403] was successfully translated to and evaluated by CAS with our proposed framework LACAST [2, 13]. Therefore, our translation pipeline contains two steps: First, the semantic enhancement process towards the *semantic* L ATEX dialect used by the DLMF. Second, the translation from semantic LATEX to CAS via LACAST. In this paper, we focus on the frst step. The second phase is largely covered by [2, 11, 13]. A more comprehensive overview was given in Section 2.4.

### **4.2.3 Formal Mathematical Language Translations**

First, we will introduce an abstract formalized concept for our translation approach followed by a detailed technical explanation of our system. Inspired by the pattern-matching translation approaches in compilers [263], we introduce a translation on mathematical expressions as a sequence of tree transformations. In the following, we mainly distinguish between two kinds of mathematical languages: presentational languages <sup>L</sup>*<sup>P</sup>* , such as LATEX12 or presentation MathML13, and content languages <sup>L</sup>*C*, such as content MathML, OpenMath [204], or CAS syntaxes [36, 393]. Elements of these languages are often referred to as symbol layout trees for *e* ∈ L*<sup>P</sup>* or operator trees for *e* ∈ L*<sup>C</sup>* [92]. Then we call a context-dependent translation t : L*<sup>P</sup>* × *X* → L*<sup>C</sup>* with t → t(*e, X*) *appropriate* if the intended semantic meaning of *e* ∈ L*<sup>P</sup>* is the same as t(*e, X*) ∈ L*C*. We further defne the context *X* of an expression *e* as a set of facts from the document D the expression *e* appears in and a set of common knowledge facts K so that facts from the document may overwrite facts from the common knowledge set

$$X := \{ f | f \in \mathcal{D} \cup \mathcal{K} \land (f \in \mathcal{K} \Rightarrow f \notin \mathcal{D}) \}. \tag{4.2}$$

A fact *f* is a tuple (MOI*,* MC) of a Mathematical Objects of Interest (MOI) [14] and a Mathematical Concept (MC). An MOI *m* refers to a meaningful mathematical object in a document and the MC uniquely defnes the semantics of that MOI. In particular, from the MC of an MOI *<sup>m</sup>*, we derive a semantic enhanced version *<sup>m</sup>*" of *<sup>m</sup>* so that *<sup>m</sup>*" ∈ L*C*. Hence, from *<sup>f</sup>*, we derive a graph transformation rule *<sup>r</sup><sup>f</sup>* <sup>=</sup> *<sup>m</sup>* <sup>→</sup> *<sup>m</sup>*" and defne *<sup>g</sup><sup>f</sup>* (*e*) as the application *<sup>e</sup>* <sup>⇒</sup>*r<sup>f</sup> e*˜ with *e* ∈ L*<sup>P</sup> , e*˜ ∈ L*C*.

We split the translation t(*e, X*) into two steps, a semantifcation t*s*(*e, X*) and a mapping t*m*(*e*) step. The semantifcation t*s*(*e, X*) transforms all subexpressions *e*¯ ⊆ *e* that are not operator

From LaTeX to Computer Algebra Systems

<sup>12</sup>https://www.latex-project.org/ [accessed 2021-06-29]

<sup>13</sup>https://www.w3.org/TR/\gls{mathml}3/ [accessed 2021-06-29]

trees, i.e., *e*¯ ∈ L*<sup>P</sup>* \ L*C*, to operator tree representations *e* ˜¯ ∈ L*C*. In the following, we presume that these subexpressions *e*¯ are MOI so that we can derive *e* ˜¯ from a fact *<sup>f</sup>* <sup>∈</sup> *<sup>X</sup>*. Then we defne the semantifcation step as the sequence of fact-based graph transformations

$$\text{tr}\_s(e, X) := g\_{f\_1} \circ \cdots \circ g\_{f\_n}(e), \tag{4.3}$$

with *f<sup>k</sup>* ∈ *X,k* = 1*,...,n*. Again, we call a graph transformation *g*(*e*) *appropriate* if the intended semantics of the expression *e* and its transformation *g*(*e*) are the same. Further, we call t*s*(*e, X*) *complete* if all subexpressions *e*- ⊆ t*s*(*e, X*) are in L*<sup>C</sup>* and *incomplete* otherwise. Note that graph transformations are not commutative, i.e., there could be *f*1*, f*<sup>2</sup> ∈ *X* so that *gf*<sup>1</sup> ◦ *gf*<sup>2</sup> (*e*) = *gf*<sup>2</sup> ◦ *gf*<sup>1</sup> (*e*).

The mapping step t*m*(*e*) is a sequence of applications on graph transformation rules that replace a node (or subtree) with the codomain-specifc syntax version of the node (or subtree). Hence, the mapping step is a context-independent translation t*<sup>m</sup>* : L*C*<sup>1</sup> → L*C*<sup>2</sup> with L*C*<sup>1</sup> *,*L*C*<sup>2</sup> ⊂ L*<sup>C</sup>* and a fxed rule set <sup>R</sup>*C*<sup>1</sup> *<sup>C</sup>*<sup>2</sup> so that *<sup>r</sup><sup>k</sup>* <sup>=</sup> <sup>L</sup>*C*<sup>1</sup> → L*C*<sup>2</sup> for *<sup>r</sup><sup>k</sup>* ∈ R*C*<sup>1</sup> *<sup>C</sup>*<sup>2</sup> *, k* = 1*,...,n*. Then we defne

$$\mathbf{t}\_m(e) := g\_{r\_1} \diamond \cdots \diamond g\_{r\_n}(e). \tag{4.4}$$

Note thatt*m*(*e*) ignores subexpressions *e*¯ ⊆ *e* that are not in L*C*. For CAS languages L*<sup>M</sup>* ⊂ L*C*, certain subtrees of an expression *e*˜ ⊆ *e* ∈ L*<sup>P</sup>* are operator trees in the target language, *e*˜ ∈ L*M*. Hence, we call t*m*(*e*) complete, if all *e*- ⊂ *e* with *e*- ∈ L*C*<sup>1</sup> \ L*C*<sup>2</sup> were transformed to L*C*<sup>2</sup> . Note that a complete t*m*(*e*) is not necessarily appropriate because such an *e* ∈ L*<sup>P</sup>* ∩ L*<sup>C</sup>* could have a diferent semantic meaning in L*<sup>P</sup>* and L*<sup>C</sup>* (see the *π* example from the introduction).

### **Definition of a Context-Sensitive Translation Function**

For a given target CAS language <sup>L</sup>*<sup>M</sup>* ⊂ L*C*, a set of rules <sup>R</sup>*<sup>C</sup> <sup>M</sup>*, and a context *X*, we defne the two step translation process as

$$\mathbf{t}: \mathcal{L}\_P \times X \to \mathcal{L}\_C \qquad \mathbf{t}(e, X) := \mathbf{t}\_m(\mathbf{t}\_s(e, X)). \tag{4.5}$$

We call t(*e, X*) *complete* if t*s*(*e, X*) and t*m*(*e*) are *complete* and *appropriate*.

Splitting the translation t(*e, X*) into these two steps has the advantage of modularity. Considering an appropriate and complete semantifcation, we can translate an expression *e* to any context language <sup>L</sup>*<sup>M</sup>* ⊂ L*<sup>C</sup>* by using a diferent set of rules <sup>R</sup>*<sup>C</sup> <sup>M</sup>* for t*m*(*e*). In previous research, we developed LACAST [3, 13] as an implementation of t*m*(*e*) between the content languages *semantic* L ATEX [403] (the semantic enhanced LATEX used in the DLMF) and the CAS syntaxes of Maple and Mathematica. Technically, semantic LATEX is simply normal LATEX, where specifc subexpressions are replaced by semantic enhanced macros. In this paper, we extend LACAST to identify the subexpressions that can be replaced with these semantic LATEX macros. This semantifcation is our frst translation step t*s*(*e, X*). The results t*s*(*e, X*) are in semantic LATEX which is in L*C*. For the second step (the mapping), we rely on the original LACAST implementation (from semantic L ATEX to CAS syntaxes) for t*m*(*e*) and presume that t*m*(*e*) is complete and appropriate [2, 11].

To perform a complete and appropriate semantifcation, we need to solve three remaining issues. First, how can we derive sufciently many facts from a document *f* ∈ D so that the transformation rules *r<sup>f</sup>* are appropriate and the semantifcation t*s*(*e, X*) is appropriate and

Figure 4.2: The workfow of our context-sensitive translation pipeline from LATEX to CAS syntaxes.

complete? Second, since the transformation rules are not commutative, a diferent order of facts may result in an inappropriate semantifcation t*s*(*e, X*). Hence, we need to develop a fact-ranking rk(*f*) so that the sequence of transformations is performed in an appropriate order. Third, how can we determine if a translation was appropriate and complete? There is no general solution available to determine the intended semantic information of an expression *e* ∈ L*<sup>P</sup>* . In turn, it is probably impossible to certainly determine if a translation is appropriate for general expressions. Therefore, we propose diferent evaluation approaches that allow automatically verifying the appropriateness and completeness of a translation. We performed the same evaluation approaches on the manually annotated semantic LATEX sources of the DLMF and successfully identifed errors in the DLMF and the two CAS Maple and Mathematica [2, 11]. Hence, we presume the same technique is appropriate to detect errors in Wikipedia too. In addition to these verifcation evaluations, we perform a manual evaluation on a smaller test set for a qualitative analysis.

The number of facts (transformation rules) that we derive from a document D is critical. A low number of transformation rules may result in an incomplete translation. On the other hand, too many transformation rules may increase the number of false positives and result in an inappropriate transformation. To solve this issue, we propose a dependency graph of mathematical expressions containing the MOI of a document as nodes. A dependency in this graph describes the subexpression relationship between two MOI. We further annotate each MOI with textual descriptions from the surrounding context. We interpret these descriptions as references to the mathematical concepts MC that defnes the MOI and rank each description according to distance and heuristic measures. Since MOI are often compositions of other MOI, the dependencies allow us to derive relevant facts for an expression *e* from the subexpressions *e*- <sup>⊆</sup> *<sup>e</sup>*. To derive a semantically enhanced version *<sup>m</sup>*" for an MOI *<sup>m</sup>*, we use the semantic macros from the DLMF. Each semantic macro is a semantically enhanced version *<sup>m</sup>*" of a standard representational *m*. To derive relevant semantic macros, i.e., transformation rules, we search for the semantic macro's description that matches the MC of the facts. In turn, we have a large number of ranked facts with the same MOI *m* and a ranked list of transformation rules *r*1*,...,r<sup>n</sup>* for each fact *f*. The rankings allow us to control the number and order of the graph

#### **106 Chapter 4**

From LaTeX to Computer Algebra Systems

transformation *gf<sup>r</sup>* (*e*) in t*s*(*e, X*). In turn, the annotated dependency graph should solve the mentioned issues one and two. The pipeline is visualized in Figure 4.2. The rest of this section explains the pipeline in more detail. The third issue, i.e., determining the appropriateness and completeness of a translation is discussed in Section 5.2 in Chapter 5.

### **4.2.3.1 Example of a Formal Translation**

Consider the example from the introduction *π*(*x* + *y*) in a document D that describes *π*(*x*) as the prime counting function. Hence, we derive the fact

$$\mathcal{J} = (\pi(x), \text{prime counting function}) \in \mathcal{D}. \tag{4.6}$$

In our dependency graph, *π*(*x* + *y*) depends on *π*(*x*). Hence, we derive the same fact *f* for *π*(*x*+*y*). Based on this fact, we fnd a function in the DLMF described as 'the number of primes not exceeding *x*' which uses the semantic macro \nprimes@{x} and the presentation *π*(*x*). Hence, we derive the transformation rule

$$r\_f = \langle \text{pi}(v\_1) \to \langle \text{nprime} \otimes \{v\_1\}, \tag{4.7}$$

where *v*<sup>1</sup> is a wildcard for variables. For simplicity reasons, this example only derived a single transformation rule *r<sup>f</sup>* rather than an entire set of ranked rules and facts as described above. Our fnal pipeline will derive an entire list of ranked facts and replacement rules that are successively applied. LACAST defnes a translation rule *<sup>r</sup>*<sup>1</sup> ∈ R*<sup>C</sup>* Mathematica for this function to PrimePi[x] and a rule *<sup>r</sup>*<sup>2</sup> ∈ R*<sup>C</sup>* Maple to pi(x) in Maple14, respectively. Hence, the translation to Mathematica would be performed via *r*<sup>1</sup> as

$$t(\psi(\mathbf{x} \cdot \mathbf{y}), X) = \ t\_m(t\_s(\psi(\mathbf{x} \cdot \mathbf{y}), X))\tag{4.8}$$

$$=\ \ (g\_{r\_1}(g\_f(\varphi \mathbf{x} \mathbf{y} \mathbf{y})))\tag{4.9}$$

$$=\,\_{g\_{v\_1}}(\text{\textquotedblleft npx \text{\textquotedblright} \text{\textquotedblleft}x+y\text{\textquotedblright}})\tag{4.10}$$

$$\mathbf{x} = \begin{array}{c} \mathbb{P}x \text{ i } \mathsf{m} \mathsf{e} \mathsf{P} \mathsf{i} \ \mathsf{\{x}} \mathsf{+} \mathsf{y} \mathsf{\{} \mathsf{.} \tag{4.11}$$

For Maple, the translation process is performed via *r*<sup>2</sup> instead

$$t(\pi \text{(x+y)}, X) \;=\; t\_m(t\_s(\pi \text{(x+y)}, X))\tag{4.12}$$

$$=\left(g\_{\nu\_{\mathbf{z}}}(g\_{\mathcal{J}}(\text{\triangleright\subset\mathbf{x}\text{-y}}))\right)\tag{4.13}$$

$$=\,\_{g\_{r\_2}}(\mathsf{map}\,\mathsf{x}\,\mathsf{meas}\,\mathsf{o}\,\mathsf{Cx}\,\mathsf{y}\,\mathsf{P})\tag{4.14}$$

$$\mathbf{x} = \begin{array}{c} \mathbf{p} \pm \mathbf{(x+y)}. \\\\ \end{array} \tag{4.15}$$

This underlines the modular system of our translation pipeline. Further, LACAST takes care of additional requirements for successful translations. In this particular example, LACAST informs a user about the requirement of loading the NumberTheory package in Maple in order to use the translated expression pi(x+y). Note that the subexpression *x* + *y* was not transformed by *g<sup>f</sup>* (*e*) nor by *gr*<sup>1</sup> (*e*), because *x* + *y* ∈ L*<sup>M</sup>* ∩ L*<sup>P</sup>* . Hence, this translation is complete and appropriate.

<sup>14</sup>Maple requires to pre-load the *NumberTheory* package.

### **4.2.4 Document Pre-Processing**

For extracting the facts from a document D, we need to identify all MOI and MC. In previous research [329], we have shown that noun phrases can represent defniens of identifers. Hence, we presume noun phrases are good candidates for MCs too. To properly extract noun phrases, we use CoreNLP [240] as our POS tagger [367, 368]. Since CoreNLP is unable to parse mathematics, we replace all math by placeholders frst. In a previous project [279], we proposed a Mathematical Language Processor (MLP) that replaces mathematical expressions with placeholders. Occasionally, this approach yields wrong annotations. For example, CoreNLP may tag *factorial* or *polynomial* as adjectives when a math token follows, even in cases where they are clearly naming mathematical objects15. However, the MLP approach works reasonably well in most cases.

Since Wikipedia articles are written in Wikitext, we use Sweble [99] to parse an article, replace MOI with placeholders, remove visual templates, and generate a plain text version of an article. Wikipedia ofcially recommends encoding in-line mathematics via templates that do not use L ATEX encoding (see Appendix B available in the electronic supplementary material for more details about math formulae in Wikipedia). In addition, since Wikipedia is community-driven, many mathematical expressions are not properly annotated as such. This makes it challenging to detect all MOI in a given document. For example, the Jacobi polynomial article<sup>16</sup> contains several formulae that do not use the math template nor the <math> tag (for LATEX), such as the single identifer ''x'' and the UTF-8 character sequences < 0, [, {{pi}}-], and 0 ≤ *φ* ≤ 4{{pi}}. As an approach to detect such erroneous math, we consider sequences of symbols with specifc Unicode properties as math. This includes the properties Sm for math symbols, Sk for symbol modifer, Ps, Pe, Pd, and Po for several forms of punctuation and brackets, and Greek for Greek letters. In addition, single letters in italic, e.g., ''x'', are interpreted as math as well, which was already successfully used by MLP. Via MLP we also replace UTF-8 characters by their TEX equivalent. In the end, the erroneous UTF-8 encoded sequence 0 ≤ *φ* ≤ 4{{pi}} is replaced by 0 \leq \phi \leq 4\pi and considered as a single MOI. Using this approach, we detect 27 math-tags, 11 math-templates (including one numblk), and 13 in-line mathematics with erroneous annotations in the Jacobi polynomials article. The in-line math contains six single italic letters and seven complex sequences. In one case, the erroneous math was given in parentheses and the closing parenthesis was falsely identifed as part of the math expression. Every other detection was correct. In the future, more in-depth studies can be applied to improve the accuracy of in-line math detection in Wikitext [123, 377].

### **4.2.5 Annotated Dependency Graph Construction**

Retrieving the correct noun phrase (i.e., MC) that correctly describes a single MOI is most likely infeasible. Instead, we will retrieve multiple noun phrases for each MOI and try to rank them accordingly. In the following, we construct a mathematical dependency graph for Wikipedia articles in order to retrieve as many relevant noun phrases for an MOI as possible. As we have discussed in an earlier project [214], there are multiple valid options to construct a dependency graph. We decided to use the POM tagger [402] to generate parse trees from LATEX expressions

<sup>15</sup>For example, 'The Jacobi polynomial MATH\_1 is an orthogonal polynomial.' Both 'polynomial' tokens in this sentence are tagged as JJ (Adjective) with CoreNLP version 4.2.2.

<sup>16</sup>https://en.wikipedia.org/wiki/Jacobi\_polynomials [accessed 2021-06-07]

to build a dependency graph. The POM tagger lets us establish dependencies by comparing annotated, semantic parse trees. Since the POM tagger aims to disambiguate mathematical expressions in the future, the accuracy of our new dependency graph directly scales with an increasing amount of semantic information available to the POM tagger. In addition, the more the POM tagger is able to disambiguate expressions, the more subexpressions *e*¯ ⊆ *e* ∈ L*<sup>P</sup>* are already in our target language *e*¯ ∈ L*M*. Our translator LACAST also relies on the parse tree of the POM tagger [3, 13]. Technically, this allows us to feed LACAST directly with additional semantic information via manipulating the parse tree from the POM tagger. For example, consider the expression *a*(*b*+*c*). In general, LACAST would interpret the expression as a multiplication between *a* and (*b* + *c*), as most conversion tools would [18]. However, we can easily tag the frst token *a* as a function in the parse tree and thereby change the translation accordingly without further programmatic changes. In the following, we only work on the parse tree of the POM tagger, which can be considered as part of L*<sup>P</sup>* .

To establish dependencies between MOI, we introduce the concept of a mathematical stem (similar to 'word stems' in natural languages) that describes the static part of a function that does not change, e.g., the red tokens in Γ(*x*) or *P<sup>n</sup>* (*α,β*) (*x*). Mathematical functions often have a unique identifer as part of the stem that represents the function, such as Γ(*x*) or *<sup>P</sup>*(*α,β*) *<sup>n</sup>* (*x*). The identifcation of a stem of an MOI, however, is already context-dependent. As our introduction example of *π*(*x* + *y*) shows, the location of the stem depends on the identifcation of *π*(*x* + *y*) as the prime counting function. At this point in our pipeline, we lack sufcient semantic information about the MOI to identify the stem. On the other hand, a basic logic is necessary to avoid erroneous MOI dependencies. We apply the following heuristic for an MOI dependency: (i) at least one identifer must match in the same position in both MOI and (ii) this identifer is not embraced by parenthesis. Now, we replace every identifer in an MOI *m*<sup>1</sup> by a wildcard that matches a sequence of tokens or entire subtrees. If this pattern matches another MOI *m*<sup>2</sup> and the match obeys our heuristics (i) and (ii), we say *m*<sup>2</sup> depends on *m*<sup>1</sup> and defne a directed edge from *m*<sup>1</sup> to *m*<sup>2</sup> in the graph. With the second heuristic, we avoid a dependency between Γ(*x*) and *π*(*x*) (since *x* fulfll the frst heuristic but not the second). In the future, it would be worthwhile to study more heuristics on MOI to identify the stem via machine learning algorithms. A more comprehensive heuristic analysis is desirable, since not every function has a unique identifer in the stem, e.g., the Pochhammer's symbol (*x*)*n*. Examples of dependencies between MOI can be found in the Appendix F.2 available in the electronic supplementary material and on our demo page.

In addition to the new concept for addressing math stems, we also changed our approach for defnition detection. Previously [214], we presumed that every equation symbol declares a defnition for the left-hand side expression. This would have a signifcant impact on the translation to CAS. Further, defnitions must be translated diferently compared to normal equations. Currently, there is no reliable approach available to distinguish an equation from a defnition. Existing approaches try to classify entire textual sections in a document as defnitions [111, 134, 183, 370] but not a single formula. We will elaborate more on this matter in Section 5.2.3. For now, we only consider an equation symbol as a defnition if it is explicitly declared as such via :=.

For annotating MOIs with textual descriptions, we frst used a support vector machine [213] and later applied distance metrics [279, 329, 330] between single identifers and textual descriptions. We were able to reach an F1 score of *.*36 for annotating single identifers with textual descriptions. Since we are working on more complex, less overloaded [14], MOI expressions now, we can presume an improvement if we apply the same approach again. Hence, we used our latest improvements [330] and applied some changes to annotate MOI rather than single identifers with textual descriptions from the surrounding context. Originally, we considered only nouns, noun sequences, adjectives followed by nouns, and Wikipedia links as candidates of defniens (now MC) [329]. However, in the feld of OPSF, such descriptions are generally insufcient. Hence, we include connective possessive endings and prepositions between noun phrases (see Appendix F.1 available in the electronic supplementary material for further details).

Originally [329], we scored an identifer-defniens pair based on (1) the distance between the current identifer and its frst occurrence in the document, (2) the distance (shortest path in the parse tree) between the defniens and the identifer, and (3) the distribution of the defniens in the sentence. We adopt this scoring technique for MOI and MC with slight adjustments. For condition (2), we declare the frst noun in an MC as the representative token in the natural language parse tree. Therefore, (2) uses the shortest path between an MOI and the representative token in the parse tree. For condition (1), we need to identify the locations of MOIs throughout an entire document. Our dependency graph allows us to track the location of an MOI in the document. Hence, (1) calculates the distance of an MOI and its frst occurrence isolated or as a dependent of another MOI in the document. In addition, we set the score to 1 if a combination of MOI and noun phrases match the patterns NP MOI or MOI (is|are) DT? NP. These basic patterns have been proven to be very efective in previous experiments for extracting descriptions of mathematical expressions [213, 214, 279, 330]. We denote the fnal score of a fact *f*, i.e., of an MOI and MC pair, with sMLP(MOI*,* MC).

### **4.2.6 Semantic Macro Replacement Paterns**

Now, we derive a rule *r<sup>f</sup>* for a fact *f* so that the MOI *m* ∈ L*<sup>P</sup>* can be replaced by a semantic enhanced version *<sup>m</sup>*" ∈ L*<sup>C</sup>* of it. The main issue is that we are still unable to identify the stems of a formula. Consider we have the MOI *<sup>P</sup>*(*α,β*) *<sup>n</sup>* (*z*) identifed as *Jacobi polynomial*. How do we know the stem of a Jacobi polynomial and that *n*, *α*, *β*, and *z* are parameters and variables? For an appropriate translation, we even need to identify the right order of these arguments. There are two approaches, (i) we identify the defnition of the formula in the article or (ii) we lookup a standard notation. The frst approach works because with the defnition, we can deduce the stem of a function by identifying which identifers of the function are reused in the defnition. For example, in Figure 4.1, we see that *n*, *α*, *β*, and *z* appear in the defnition of the Jacobi polynomial but not *P*. Hence, we can conclude that the stem of the Jacobi polynomial must be *P<sup>n</sup>* (*α,β*) (*x*). There are two remaining issues with this approach. First, what if a defnition does not exist in the same article? This happens relatively often for OPSF, since OPSF are well established with more or less standard notation styles. Second, as previously pointed out, we cannot distinguish defnitions from normal equations yet. As long as there is no reliable approach to identify defnitions, approach (i) is infeasible. As a workaround, we focus on approach (ii) and leave (i) for future work.

In order to get standard notations and derive patterns of them, we use the semantic macros in the DLMF [260, 403]. A semantic macro is a semantically enhanced LATEX expression that unambiguously describes the content of the expression. Hence, we can interpret a semantic macro as an unambiguous operator subtree *<sup>m</sup>*" ∈ L*C*. The rendered version of the macro (i.e., the *normal* L ATEX version) is in a presentational format *m* ∈ L*<sup>P</sup>* . Hence, we can derive a fact-


Table 4.5: Mappings and likelihoods for the semantic LATEX macro of the general hypergeometric function in the DLMF.

based rule *<sup>r</sup><sup>f</sup>* <sup>=</sup> *<sup>m</sup>* <sup>→</sup> *<sup>m</sup>*" by fnding the appropriate semantic macro for a given mathematical description (the MC in a fact *f*). The DLMF defnes more than 600 diferent semantic macros for OPSF. A single semantic macro may produce multiple rendered forms, e.g., by omitting the parentheses around the argument in sin *x*. This allows for fne controlling the visualization of the formulae. Table 4.5 contains the four diferent versions for the general hypergeometric function (controlled by the number of @s). The last version (without variables and no @ symbol) is a special case, which never appears in the DLMF. However, every semantic macro is also syntactically valid without arguments. Note also that not every version visualizes all information that is encoded in a semantic macro. For example, \genhyperF{2}{1}@@@{a,b}{c}{z} omits the variables *a*, *b*, and *c*. Table 4.5 also shows the LATEX for each version of the macro. By replacing the arguments with wildcards, we generate a LATEX pattern *<sup>m</sup>* that defnes a rule *<sup>m</sup>* <sup>→</sup> *<sup>m</sup>*" . If the LATEX omits information, we fll the missing slots of *<sup>m</sup>*" with the default arguments denoted in the defnitions of the semantic macros. For example, the default arguments for the general hypergeometric function are *p* and *q* for the parameters and *a*1*,...,ap*, *b*1*,...,bq*, and *z* for the variables. Hence, the last version in Table 4.5 flls up the slots for the variables with these default arguments (given in gray). In addition, the default arguments from the DLMF defnitions also tell us if the argument can be a list, i.e., it may contain commas. Hence, we allow the two wildcards for the frst two variables var1 and var2 to match sequences with commas while the other wildcards are more restrictive and reject sequences with commas.

Since every semantic macro in the DLMF has a description, we can retrieve semantic macros and also the replacement rule *r<sup>f</sup>* , by using the annotations in the dependency graph as search queries. Currently, every fact has an MLP score sMLP(*f*). But for each fact, we may retrieve multiple replacement patterns depending on how well the noun phrase (the MC) matches semantic macro description in the DLMF. To solve this issue, we develop a cumulated ranking for each fact rk(*f*). The frst part of the ranking is the MLP score sMLP(*f*) that ranks the pair of MOI and description MC. Second, we index all DLMF replacement patterns in an Elasticsearch (ES)17 database to search for a semantic macro for a given description. ES uses the BM25 score to retrieve relevant semantic macros for a given query. Hence, the second component of the ranking function is the ES score (normalized over all retrieved hits) for a retrieved semantic macro *<sup>m</sup>*" and the given description MC: <sup>s</sup>ES(*f*). Lastly, every semantic macro *<sup>m</sup>*" has multiple rendered forms, of which some are more frequently used than others in the DLMF, see the

<sup>17</sup>https://github.com/elastic/elasticsearch [accessed 2021-01-01]

probability in Table 4.5. Hence, we score a rule *<sup>r</sup><sup>f</sup>* <sup>=</sup> *<sup>m</sup>* <sup>→</sup> *<sup>m</sup>*" based on its likelihood of use in the DLMF. We counted the diferent versions of each semantic macro in the DLMF to calculate the likelihood of use. The last two replacement patterns in the Table (the ones omitting information) never appear in the DLMF and have a probability of 0%. We denote this score as sDLMF(*r<sup>f</sup>* ). The ranking for a fact rk(*f*) is simply the average over the three components sMLP(*f*), sES(*f*), and sDLMF(*r<sup>f</sup>* ).

### **4.2.6.1 Common Knowledge Patern Recognition**

Since LACAST was specifcally developed for the semantics of the DLMF, it is not aware of general mathematical notation conventions. We fxed this issue by defning rules as part of the common knowledge K set of facts. We rank facts from K higher compared to facts from the article A to perform common knowledge pre-processing transformations prior to the facts derived from the article. Note that we do not presume that the following rules are always true. However, in the context of OPSF, we achieved better results by activating them by default and, if applicable, deactivating them for certain scenarios. This includes that *π* is always interpreted as the constant, *e* is Euler's number if *e* is followed by a superscript (power) at least once in the expression, *i* is the imaginary unit if it does not appear in a subscript (index), *γ* is the Euler-Mascheroni constant if the terms *Mascheroni* or *Euler* exists in any *f* ∈ A. Note that these heuristics are consistent in an equation, i.e., *i* is never both an index and the imaginary unit within one equation. Further, we add rules for derivative notations, such as <sup>d</sup>*<sup>y</sup>* <sup>d</sup>*<sup>x</sup>* where *y* is optional and d can be followed by a superscript with a numeric value. In addition, LACAST presumes \diff{.} (e.g., for d*x*) after integrals indicating the end of the argument of an integral. Hence, we search for *d* or d<sup>18</sup> followed by a letter after integrals to replace it with \diff{.} (see [11] for a more detailed discussion on this approach). Finally, a letter preceding parenthesis is tagged as a function in the parse tree, if the expression in parenthesis contains commas or semicolons or it does not contain arithmetic symbols, such as + or −. Note that once a symbol is identifed as a function following this rule, it is tagged as such everywhere, regardless of the local situation. For example, in *f*(*x*+*π*) = *f*(*x*) we would tag *f* as a function even though the frst part *f*(*x* + *π*) violates the mentioned rule. As previously mentioned, this changes the translation from f\*(x+Pi) in Mathematica to f[x+Pi]. We provide a detailed step-by-step example of the translation pipeline and an interactive demo at: https://tpami.wmflabs.org.

This Chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).

<sup>18</sup>Note the diference between normal *d* and the roman typestyle d.

*It is possible to commit no mistakes and still lose. That is not a weakness, that is life.*

Jean-Luc Picard - *Star Trek: The Next Generation*

### **CHAPTER 5**

### **Qalitative and Qantitative Evaluations**

### **Contents**


This chapter primarily contributes to the research task **V**, i.e., evaluating the efectiveness of the semantifcation and translation system LACAST. In Section 5.1, we also extend LACAST semantic L ATEX translations to support more mathematical operators, including sums, products, integrals, and limit notations. Hence, this chapter secondarily also contributes to research task **IV**, i.e.,

**Supplementary Information** The online version contains supplementary material available at https://doi.org/10.1007/978-3-658-40473-4\_5.

© The Author(s) 2023 A. Greiner-Petter, *Making Presentation Math Computable*, https://doi.org/10.1007/978-3-658-40473-4\_5

implementing an extension of the semantifcation approach to provide translations to CAS. We evaluate LACAST on two diferent datasets: the DLMF and Wikipedia.

First, we evaluate LACAST on the DLMF to estimate the capabilities and limitations of our rulebased translator on a semantic enhanced dataset. Translating formulae from the DLMF to CAS can be considered simpler primarily for three reasons. First, the formulae are manually enhanced and can be considered unambiguous in most cases. Second, the constraints of formulae are directly attached to equations and therefore accessible to LACAST. Lastly, parts of equations in the DLMF are linked to their defnitions which allow to resolve substitutions and fetch additional constraints. This meta information is either not available or given in the surrounding context in Wikipedia articles which greatly harms the accessibility of this crucial data. Hence, we presume that we achieve the best possible translations via LACAST on the DLMF. For evaluating the capabilities of LACAST, we perform numeric and symbolic evaluation techniques to evaluate a translation [3, 13]. We will further use these evaluation approaches to identify faws in the DLMF and CAS computations.

Next, we evaluate LACAST on Wikipedia as the direct successor of the previous Chapter 4. Here, we use the full and fnal version of LACAST, including every improvement that has been discussed throughout the thesis. Specifcally, it actively uses all common knowledge pattern recognition techniques discussed in Section 4.2.6.1, all heuristics for detecting math operators introduced in Section 5.1.2, and the enhanced symbolic and numeric evaluation pipeline frst outlined in [3] and fnally elaborated in Section 5.1.3. In combination with the automatic evaluation, we are able to perform plausibility checks on complex mathematical formulae in Wikipedia.

This chapter is split in two parts following two main motivations behind them. In Section 5.1, we elaborate the possibility to use LACAST translations to automatically verify entire DML and CAS with one another. We specifcally focus on the DLMF for our DML and Mathematica and Maple for our general-purpose CAS. In Section 5.2, we use the fnal context-sensitive version of LACAST introduced in Chapter 4, including every improvement introduced in the frst Section 5.1 of this chapter, with the goal to verify equations in Wikipedia articles. This chapter fnalizes the improvements of LACAST for semantic LATEX expressions (Section 5.1) and general L ATEX expressions (Section 5.2).

The content of Section 5.1 was published at the TACAS conference [8]. Some parts in Section 5.2 have also been previously published at the CICM conference [2]. Section 5.2, as the direct successor of Chapter 4, is part of the aforementioned submission to the TPAMI journal [11].

### **5.1 Evaluations on the Digital Library of Mathematical Functions**

Digital Mathematical Library (DML) gather the knowledge and results from thousands of years of mathematical research. Even though pure and applied mathematics are precise disciplines, gathering their knowledge bases over many years results in issues which every digital library shares: consistency, completeness, and accuracy. Likewise, CAS1 play a crucial role in the modern era for pure and applied mathematics, and those felds which rely on them. CAS can be used to simplify, manipulate, compute, and visualize mathematical expressions. Accordingly,

<sup>1</sup> In the sequel, the acronyms CAS and DML are used, depending on the context, interchangeably with their plurals.

modern research regularly uses DML and CAS together. Nonetheless, DML [2, 10] and CAS [20, 100, 180] are not exempt from having bugs or errors. Durán et al. [100] even raised the rather dramatic question: "*can we trust in [CAS]*?"

Existing comprehensive DML, such as the DLMF [98], are consistently updated and frequently corrected with errata2. Although each chapter of the DLMF and its print analog *The NIST Handbook of Mathematical Functions* [276] has been carefully written, edited, validated, and proofread over many years, errors still remain. Maintaining a DML, such as the DLMF, is a laborious process. Likewise, CAS are eminently complex systems, and in the case of commercial products, often similar to black boxes in which the magic (i.e., the computations) happens in opaque private code [100]. CAS, especially commercial products, are often exclusively tested internally during development.

An independent examination process can improve testing and increase trust in the systems and libraries. Hence, we want to elaborate on the following research question.

### **Research Qestion**

How can digital mathematical libraries and computer algebra systems be utilized to improve and verify one another?

Our initial approach for answering this question is inspired by Cohl et al. [2]. In order to verify a translation tool from a specifc LATEX dialect to Maple , they perform symbolic and numeric evaluations on equations from the DLMF. This approach presumes that a proven equation in a DML must be also valid in a CAS. In turn, a disparity in between the DML and CAS would lead to an issue in the translation process. However, assuming a correct translation, a disparity would also indicate an issue either in the DML source or the CAS implementation. In turn, we can take advantage of the same approach proposed by Cohl et al. [2] to improve and even verify DML with CAS and vice versa. Unfortunately, previous eforts to translate mathematical expressions from various formats, such as LATEX [3, 10], MathML [18], or OpenMath [152], to CAS syntax show that the translation will be the most critical part of this verifcation approach.

In this section, we elaborate on the feasibility and limitations of the translation approach from DML to CAS as a possible answer to our research question. We further focus on the DLMF as our DML and the two general-purpose CAS Maple and Mathematica for this frst study. This relatively sharp limitation is necessary in order to analyze the capabilities of the underlying approach to verify commercial CAS and large DML. The DLMF uses semantic macros internally in order to disambiguate mathematical expressions [260, 403]. These macros help to mitigate the open issue of retrieving sufcient semantic information from a context to perform translations to formal languages [10, 18]. Further, the DLMF and general-purpose CAS have a relatively large overlap in coverage of special functions and orthogonal polynomials. Since many of those functions play a crucial role in a large variety of diferent research felds, we focus in this frst study mainly on these functions.

In particular, we extend the frst version of LACAST [3] to increase the number of translatable functions in the DLMF signifcantly. Current extensions include a new handling of constraints, the support for the mathematical operators: sum, product, limit, and integral, as well as overcoming

<sup>2</sup> https://dlmf.nist.gov/errata/ [accessed 2021-05-01]

semantic hurdles associated with Lagrange (prime) notations commonly used for diferentiation. Further, we extend its support to include Mathematica using the freely available WED3 (hereafter, with Mathematica, we refer to the WED). These improvements allow us to cover a larger portion of the DLMF, increase the reliability of the translations via LACAST, and allow for comparisons between two major general-purpose CAS for the frst time, namely Maple and Mathematica. Finally, we provide open access to all the results contained within this paper4.

The section is structured as follows. Section 5.1.1 explains the data in the DLMF. Section 5.1.2 focus on the improvements of LACAST that had been made to make the translation as comprehensive and reliable as possible for the upcoming evaluation. Section 5.1.3 explains the symbolic and numeric evaluation pipeline. We will provide an in-depth discussion of that process in Section 5.1.3. Subsequently, we analyze the results in Section 5.1.4. Finally, we conclude the fndings and provide an outlook for upcoming projects in Section 5.1.5.

**Related Work** Existing verifcation techniques for CAS often focus on specifc subroutines or functions [45, 58, 107, 148, 180, 185, 225, 228], such as a specifc theorems [218], diferential equations [153], or the implementation of the math.h library [224]. Most common are verifcation approaches that rely on intermediate verifcation languages [45, 148, 153, 180, 185], such as *Boogie* [29, 225] or *Why3* [41, 185], which, in turn, rely on proof assistants and theorem provers, such as *Coq* [37, 45], *Isabelle* [153, 167], or *HOL Light* [146, 148, 180]. Kaliszyk and Wiedijk [180] proposed on entire new CAS which is built on top of the proof assistant HOL Light so that each simplifcation step can be proven by the underlying architecture. Lewis and Wester [228] manually compared the symbolic computations on polynomials and matrices with seven CAS. Aguirregabiria et al. [20] suggested to teach students the known traps and difculties with evaluations in CAS instead to reduce the overreliance on computational solutions.

We [2] developed the aforementioned translation tool LACAST, which translates expressions from a semantically enhanced LACAST dialect to Maple. By evaluating the performance and accuracy of the translations, we were able to discover a sign-error in one the DLMF's equations [2]. While the evaluation was not intended to verify the DLMF, the translations by the rule-based translator L ACAST provided sufcient robustness to identify issues in the underlying library. To the best of our knowledge, besides this related evaluation via LACAST, there are no existing libraries or tools that allow for automatic verifcation of DML.

### **5.1.1 The DLMF dataset**

In the modern era, most mathematical texts (handbooks, journal publications, magazines, monographs, treatises, proceedings, etc.) are written using the document preparation system L ATEX. However, the focus of LATEX is for precise control of the rendering mechanics rather than for a semantic description of its content. In contrast, CAS syntax is coercively unambiguous in order to interpret the input correctly. Hence, a transformation tool from DML to CAS must disambiguate mathematical expressions. While there is an ongoing efort towards such a process [14, 214, 329, 402, 408], there is no reliable tool available to disambiguate mathematics sufciently to date.

<sup>3</sup> https://www.wolfram.com/engine/ [accessed 2021-05-01]

<sup>4</sup> https://lacast.wmflabs.org/ [accessed 2021-10-01]

The DLMF contains numerous relations between functions and many other properties. It is written in LATEX but uses specifc semantic macros when applicable [403]. These semantic macros represent a unique function or polynomial defned in the DLMF. Hence, the semantic L ATEX used in the DLMF is often unambiguous. For a successful evaluation via CAS, we also need to utilize all requirements of an equation, such as constraints, domains, or substitutions. The DLMF provides this additional data too and generally in a machine-readable form [403]. This data is accessible via the i-boxes (information boxes next to an equation marked with the icon ). If the information is not given in the attached i-box or the information is incorrect, the translation via LACAST would fail. The i-boxes, however, do not contain information about branch cuts (see Section 5.1.4.1) or constraints. Constraints are accessible if they are directly attached to an equation. If they appear in the text (or even a title), LACAST cannot utilize them. The test dataset, we are using, was generated from DLMF Version 1.0.28 (2020-09-15) and contained 9*,*977 formulae with 1*,*505 defned symbols, 50*,*590 used symbols, 2*,*691 constraints, and 2*,*443 warnings for non-semantic expressions, i.e., expressions without semantic macros [403]. Note that the DLMF does not provide access to the underlying LATEX source. Therefore, we added the source of every equation to our result dataset.

### **5.1.2 Semantic LaTeX to CAS translation**

The aforementioned translator LACAST was frst developed by Greiner-Petter et al. [3, 10]. They reported a coverage of 53.6% translations [3] for a manually selected part of the DLMF to the CAS Maple. Afterward, they extended LACAST to perform symbolic and numeric evaluations on the entire DLMF and reported a coverage of 58.8% translations [2]. This version of LACAST serves as a baseline for our improvements. They reported a success rate of ∼16% for symbolic and ∼12% for numeric verifcations.

Evaluating the baseline on the entire DLMF result in a coverage of only 31.6%. Hence, we frst want to increase the coverage of LACAST on the DLMF. To achieve this goal, we frst increasing the number of translatable semantic macros by manually defning more translation patterns for special functions and orthogonal polynomials. For Maple, we increased the number from 201 to 261. For Mathematica, we defne 279 new translation patterns which enables LACAST to perform translations to Mathematica. Even though the DLMF uses 675 distinguished semantic macros, we cover ∼70% of all DLMF equations with our extended list of translation patterns (see Zipf's law for mathematical notations [14]). In addition, we implemented rules for translations that are applicable in the context of the DLMF, e.g., ignore ellipsis following foating-point values or \choose always refers to a binomial expression. Finally, we tackle the remaining issues outlined by Cohl et al. [2] which can be categorized into three groups: (i) expressions of which the arguments of operators are not clear, namely sums, products, integrals, and limits; (ii) expressions with prime symbols indicating diferentiation; and (iii) expressions that contain ellipsis. While we solve some of the cases in Group (iii) by ignoring ellipsis following foatingpoint values, most of these cases remain unresolved.

In the following, we frst introduce the constraint handling via blueprints5. Next, we elaborate our solutions for (i) in Section 5.1.2.2 and (ii) in Section 5.1.2.3.

<sup>5</sup> This subsection 5.1.2.1 was previously published by Cohl et al. [2].

### **5.1.2.1 Constraint Handling**

Correct assumptions about variable domains are essential for CAS systems, and not surprisingly lead to signifcant improvements in the CAS ability to simplify. The DLMF provides constraint (variable domain) metadata for formulae, and we have extracted this formula metadata. We have incorporated these constraints as assumptions for the simplifcation process (see Section 5.1.3.1). Note however, that a direct translation of the constraint metadata is usually not sufcient for a CAS to be able to understand it. Furthermore, testing invalid values for numerical tests returns incorrect results (see Section 5.1.3.2).

For instance diferent symbols must be interpreted diferently depending on the usage. One must be able to interpret correctly certain notations of this kind. For instance, one must be able to interpret the command a,b\in A, which indicates that both variables a and b are elements of the set A (or more generally a\_1,\dots,a\_n\in A). Similar conventions are often used for variables being elements of other sets such as the sets of rational, real or complex numbers, or for subsets of those sets.

Also, one must be able to interpret the constraints as variables in sets defned using an equals notation such as n=0,1,2,\dots, which indicates that the variable n is a integer greater than or equal to zero, or together n,m=0,1,2,\dots, both the variables n and m are elements of this set. Since mathematicians who write LATEX are often casual about expressions such as these, one should know that 0,1,2,\dots is the same as 0,1,\dots. Consistently, one must also be able to correctly interpret infnite sets (represented as strings) such as =1,2,\dots, =1,2,3,\dots, =-1,0,1,2,\dots, =0,2,4,\dots, or even =3,7,11,\dots, or =5,9,13,\dots. One must be able to interpret fnite sets such as =1,2, =1,2,3, or =1,2,\dots,N.

An entire language of translation of mathematical notation must be understood in order for CAS to be able to understand constraints. In mathematics, the syntax of constraints is often very compact and contains textual explanations. Translating constraints from LATEX to CAS is a compact task because CAS only allow precise and strict syntax formats. For example, the typical constraint 0 *<x<* 1 is invalid if directly translated to Maple, because it would need to be translated to two separate constraints, namely *x >* 0 and *x <* 1.

We have improved the handling and translation of variable constraints/assumptions for simplifcation and numerical evaluation. Adding assumptions about the constrained variables improves the efectiveness of Maple's simplify function. Our previous approach for constraint handling for numerical tests was to extract a pre-defned set of test values and to flter invalid values according to the constraints. Because of this strategy, there often was no longer any valid values remaining after the fltering. To overcome this issue, instead, we chose a single numerical value for a variable that appears in a pre-defned constraint. For example, if a test case contains the constraint 0 *<x<* 1, we chose *x* = <sup>1</sup> 2 .

A naive approach for this strategy, is to apply regular expressions to identify a match between a constraint and a rule. However, we believed that this approach does not scale well when it comes to more and more pre-defned rules and more complex constraints. Hence, we used the POM-tagger to create blueprints of the parse trees for pre-defned rules. For the example LATEX constraint \$0 < x < 1\$, rendered as 0 *<x<* 1, our textual rule is given by

0 < var < 1 ==> 1/2.

The parse tree for this blueprint constraint contains fve tokens, where var is an alphanumerical token that is considered to be a placeholder for a variable.

We can also distinguish multiple variables by adding an index to the placeholder. For example, the rule we generated for the mathematical LATEX constraint \$x,y \in \Real\$, where \Real is the semantic macro which represents the set of real numbers, and rendered as *x, y* ∈ R, is given by

$$\text{var1}, \text{ var2 } \in \text{Real ==} \ge 3/2, 3/2.$$

A constraint will match one of the blueprints if the number, the ordering, and the type of the tokens are equal. Allowed matching tokens for the variable placeholders are Latin or Greek letters and alphanumerical tokens.

### **5.1.2.2 Parse sums, products, integrals, and limits**

Here we consider common notations for the sum, product, integral, and limit operators. For these operators, one may consider Mathematically Essential Operator Metadata (MEOM). For all these operators, the MEOM includes *argument(s)* and *bound variable(s)*. The operators act on the arguments, which are themselves functions of the bound variable(s). For sums and products, the bound variables are referred to as *indices*. The bound variables for integrals6 are called *integration variables*. For limits, the bound variables are continuous variables (for limits of continuous functions) and indices (for limits of sequences). For integrals, MEOM include precise descriptions of regions of integration (e.g., piecewise continuous paths/intervals/regions). For limits, MEOM include limit points (e.g., points in <sup>R</sup>*<sup>n</sup>* or <sup>C</sup>*<sup>n</sup>* for *<sup>n</sup>*∈N), as well as information related to whether the limit to the limit point is independent or dependent on the direction in which the limit is taken (e.g., one-sided limits).

For a translation of mathematical expressions involving the LATEX commands \sum, \int, \prod, and \lim, we must extract the MEOM. This is achieved by (a) determining the argument of the operator and (b) parsing corresponding subscripts, superscripts, and arguments. For integrals, the MEOM may be complicated, but certainly contains the argument (function which will be integrated), bound (integration) variable(s) and details related to the region of integration. Bound variable extraction is usually straightforward since it is usually contained within a diferential expression (infnitesimal, pushforward, diferential 1-form, exterior derivative, measure, etc.), e.g., d*x*. Argument extraction is less straightforward since even though diferential expressions are often given at the end of the argument, sometimes the diferential expression appears in the numerator of a fraction (e.g., *<sup>f</sup>*(*x*)d*<sup>x</sup> <sup>g</sup>*(*x*) ). In which case, the argument is everything to the right of the \int (neglecting its subscripts and superscripts) up to and including the fraction involving the diferential expression (which may be replaced with 1). In cases where the diferential expression is fully to the right of the argument, then it is a *termination symbol*. Note that some scientists use an alternate notation for integrals where the diferential expression appears immediately to the right of the integral, e.g., d*xf*(*x*). However, this notation does not appear in the DLMF. If such notations are encountered, we follow the same approach that we used for sums, products, and limits (see Section 5.1.2.2).

<sup>6</sup> The notion of integrals includes: antiderivatives (indefnite integrals), defnite integrals, contour integrals, multiple (surface, volume, etc.) integrals, Riemannian volume integrals, Riemann integrals, Lebesgue integrals, Cauchy principal value integrals, etc.

**Extraction of variables and corresponding MEOM** The subscripts and superscripts of sums, products, limits, and integrals may be diferent for diferent notations and are therefore challenging to parse. For integrals, we extract the bound (integration) variable from the diferential expression. For sums and products, the upper and lower bounds may appear in the subscript or superscript. Parsing subscripts is comparable with the problem of parsing constraints [2] (which are often not consistently formulated). We overcame this complexity by manually defning patterns of common constraints and refer to them as blueprints (see Section 5.1.2.1). This blueprint pattern approach allows LACAST to identify the MEOM in the suband superscripts.

For our MEOM blueprints, we defne three placeholders: varN for single identifers or a list of identifers (delimited by commas), numL1, and numU1, representing lower and upper bound expressions, respectively. In addition, for sums and products, we need to distinguish between including and excluding boundaries, e.g., 1 *< k* and 1 ≤ *k*. An excluding relation, such as 0*<k<*10, must be interpreted as a sum from 1 to 9. Table 5.1 shows the fnal set of sum/product subscript blueprints.

Standard notations may not explicitly show infnity boundaries. Hence, we set the default boundaries to infnity. For limit expressions we need diferent blueprints to capture the limit direction. We cover the standard notations with 'var1 \to numL\*', where \* is either +, -, ^+, ^- or absent and the diferent arrow-notations where \to can be either \downarrow, \uparrow, \searrow, or \nearrow, specifying one-sided limits. Note that the arrow-notation (besides \to) is not used in the DLMF and thus, has no efect on the performance of LACAST in our evaluation. Note further that, while the blueprint approach is very fexible, it cannot handle every possible scenario, such as the divisor sum (*p*−1)|2*<sup>n</sup>* <sup>1</sup>*/p* [98, (24.10.1)]. Proper translations of such complex cases may even require symbolic manipulation, which is currently beyond the capabilities of LACAST.


Table 5.1: The table contains examples of the blueprints for subscripts of sums/products including an example expression that matches the blueprint.

**Identification of operator arguments** Once we have extracted the bound variable for sums, products, and limits, we need to determine the end of the argument. We analyzed all sums in the DLMF and developed a heuristic that covers all the formulae in the DLMF and potentially a large portion of general mathematics. Let *x* be the extracted bound variable. For sums, we consider a summand as a part of the argument if (I) it is the very frst summand after the operation; or (II) *x* is an element of the current summand; or (III) *x* is an element of the following summand (subsequent to the current summand) and there is no termination symbol between the current summand and the summand which contains *x* with an equal or lower depth according to the parse tree (i.e., closer to the root). We consider a summand as a single logical construct since addition and subtraction are granted a lower operator precedence than multiplication in mathematical expressions. Similarly, parentheses are granted higher precedence and, thus, a sequence wrapped in parentheses is part of the argument if it obeys the rules (I-III). Summands, and such sequences, are always entirely part of sums, products, and limits or entirely not.

A termination symbol always marks the end of the argument list. Termination symbols are relation symbols, e.g., =, =, ≤, closing parentheses or brackets, e.g., ), ], or *>*, and other operators with MEOMs, if and only if, they defne the same bound variable. If *x* is part of a subsequent operation, then the following operator is considered as part of the argument (as in (II)). However, a special condition for termination symbols is that it is only a termination symbol for the current chain of arguments. Consider a sum over a fraction of sums. In that case, we may reach a termination symbol within the fraction. However, the termination symbol would be deeper inside the parse tree as compared to the current list of arguments. Hence, we used the depth to determine if a termination symbol should be recognized or not. Consider an unusual notation with the binomial coefcient as an example

$$\sum\_{k=0}^{n} \binom{n}{k} = \sum\_{k=0}^{n} \frac{\prod\_{m=1}^{n} m}{\prod\_{m=1}^{k} m \prod\_{m=1}^{n-k} m} \,. \tag{5.1}$$

This equation contains two termination symbols, marked red and green. The red termination symbol = is obviously for the frst sum on the left-hand side of the equation. The green termination symbol # terminates the product to the left because both products run over the same bound variable *m*. In addition, none of the other = signs are termination symbols for the sum on the right-hand side of the equation because they are deeper in the parse tree and thus do not terminate the sum.

Note that varN in the blueprints also matches multiple bound variable, e.g., *m,k*∈*A*. In such cases, *x* from above is a list of bound variables and a summand is part of the argument if one of the elements of *x* is within this summand. Due to the translation, the operation will be split into two preceding operations, i.e., *m,k*∈*<sup>A</sup>* becomes *m*∈*A <sup>k</sup>*∈*A*. Figure 5.1 shows the extracted arguments for some example sums. The same rules apply for extraction of arguments for products and limits.

$$\begin{array}{|c|c|c|} \hline \boxed{\sum\_{n=1}^{N}c} \\ \hline \boxed{\sum\_{n=1}^{N}c + \frac{c}{n}} \\ \hline \end{array} \quad \left| \quad \overbrace{\begin{pmatrix} \sum\_{n=1}^{N}c + n^{2} \\ \sum\_{n=1}^{N}n \end{pmatrix} + N}\_{\begin{pmatrix} \sum\_{n=1}^{N}n \\ \sum\_{k=1}^{N}k \end{pmatrix}}} + \begin{array{|c|c|} \hline \boxed{\sum\_{n=1}^{N}n + \left(\sum\_{k=1}^{n}k\right)} \\ \hline \end{pmatrix}} \right| \\ \cline{3-4} \\ \hline \end{array}$$

Figure 5.1: Example argument identifcations for sums.

### **5.1.2.3 Lagrange's notation for diferentiation and derivatives**

Another remaining issue is the Lagrange (prime) notation for diferentiation, since it does not outwardly provide sufcient semantic information. This notation presents two challenges. First, we do not know with respect to which variable the diferentiation should be performed. Consider for example the Hurwitz zeta function *ζ*(*s, a*)[98, §25.11]. In the case of a diferentiation *ζ*- (*s, a*), it is not clear if the function should be diferentiated with respect to *s* or *a*. To remedy this issue, we analyzed all formulae in the DLMF which use prime notations and determined which variables (slots) for which functions represent the variables of the diferentiation. Based on our analysis, we extended the translation patterns by meta information for semantic macros according to the slot of diferentiation. For instance, in the case of the Hurwitz zeta function, the frst slot is the slot for prime diferentiation, i.e., *ζ*- (*s, a*) = <sup>d</sup> <sup>d</sup>*<sup>s</sup> ζ*(*s, a*). The identifed variables of diferentiations for the special functions in the DLMF can be considered to be the standard slots of diferentiations, e.g., in other DML, *ζ*- (*s, a*) most likely refers to <sup>d</sup> <sup>d</sup>*<sup>s</sup> ζ*(*s, a*).

The second challenge occurs if the slot of diferentiation contains complex expressions rather than single symbols, e.g., *ζ*- (*s*2*, a*). In this case, *ζ*- (*s*2*, a*) = <sup>d</sup> d(*s*2) *<sup>ζ</sup>*(*s*2*, a*) instead of <sup>d</sup> <sup>d</sup>*<sup>s</sup> <sup>ζ</sup>*(*s*2*, a*). Since CAS often do not support derivatives with respect to complex expressions, we use the inbuilt substitution functions7 in the CAS to overcome this issue. To do so, we use a temporary variable temp for the substitution. CAS perform substitutions from the inside to the outside. Hence, we can use the same temporary variable temp even for nested substitutions. Table 5.2 shows the translation performed for *ζ*- (*s*2*, a*). CAS may provide optional arguments to calculate the derivatives for certain special functions, e.g., Zeta(n,z,a) in Maple for the *n*-th derivative of the Hurwitz zeta function. However, this shorthand notation is generally not supported (e.g., Mathematica does not defne such an optional parameter). Our substitution approach is more lengthy but also more reliable. Unfortunately, lengthy expressions generally harm the performance of CAS, especially for symbolic manipulations. Hence, we have a genuine interest in keeping translations short, straightforward and readable. Thus, the substitution translation pattern is only triggered if the variable of diferentiation is not a single identifer. Note that this substitution only triggers on semantic macros. Generic functions, including prime notations, are still skipped.


Table 5.2: Example translations for the prime derivative of the Hurwitz zeta function with respect to *s*2.

A related problem to MEOM of sums, products, integrals, limits, and diferentiations are the notations of derivatives. The semantic macro for derivatives \deriv{w}{x} (rendered as <sup>d</sup>*<sup>w</sup>* <sup>d</sup>*<sup>x</sup>* ) is

<sup>7</sup> Note that Maple also support an evaluation substitution via the two-argument eval function. Since our substitution only triggers on semantic macros, we only use subs if the function is defned in Maple. In turn, as far as we know, there is no practical diference between subs and the two-argument eval in our case.

often used with an empty frst argument to render the function behind the derivative notation, e.g., \deriv{}{x}\sin@{x} for <sup>d</sup> <sup>d</sup>*<sup>x</sup>* sin *x*. This leads to the same problem we faced above for identifying MEOMs. In this case, we use the same heuristic as we did for sums, products, and limits. Note that derivatives may be written following the function argument, e.g., sin(*x*) <sup>d</sup> <sup>d</sup>*<sup>x</sup>* . If we are unable to identify any following summand that contains the variable of diferentiation before we reach a termination symbol, we look for arguments prior to the derivative according to the heuristic (I-III).

**Wronskians** With the support of prime diferentiation described above, we are also able to translate the Wronskian [98, (1.13.4)] to Maple and Mathematica. A translation requires one to identify the variable of diferentiation from the elements of the Wronskian, e.g., *z* for W {Ai(*z*)*,* Bi(*z*)} from [98, (9.2.7)]. We analyzed all Wronskians in the DLMF and discovered that most Wronskians have a special function in its argument—such as the example above. Hence, we can use our previously inserted metadata information about the slots of diferentiation to extract the variable of diferentiation from the semantic macros. If the semantic macro argument is a complex expression, we search for the identifer in the arguments that appear in both elements of the Wronskian. For example, in <sup>W</sup> {Ai(*za*)*, ζ*(*z*2*, a*)}, we extract *<sup>z</sup>* as the variable since it is the only identifer that appears in the arguments *z<sup>a</sup>* and *z*<sup>2</sup> of the elements. This approach is also used when there is no semantic macro involved, i.e., from <sup>W</sup> {*za, z*2} we extract *z* as well. If LACAST extracts multiple candidates or none, it throws a translation exception.

### **5.1.3 Evaluation of the DLMF using CAS**

Figure 5.2: The workfow of the evaluation engine and the overall results. Errors and abortions are not included. The generated dataset contains 9*,* 977 equations. In total, the case analyzer splits the data into 10*,* 930 cases of which 4*,* 307 cases were fltered. This sums up to a set of 6*,* 623 test cases in total.

For evaluating the DLMF with Maple and Mathematica, we symbolically and numerically verify the equations in the DLMF with CAS. If a verifcation fails, symbolically and numerically, we identifed an issue either in the DLMF, the CAS, or the verifcation pipeline. Note that an issue does not necessarily represent errors/bugs in the DLMF, CAS, or LACAST (see the discussion about branch cuts in Section 5.1.4.1). Figure 5.2 illustrates the pipeline of the evaluation engine. First, we analyze every equation in the DLMF (hereafter referred to as test cases). A case analyzer splits multiple relations in a single line into multiple test cases. Note that only the adjacent relations are considered, i.e., with *f*(*z*) = *g*(*z*) = *h*(*z*), we generate two test cases *f*(*z*) = *g*(*z*) and *g*(*z*) = *h*(*z*) but not *f*(*z*) = *h*(*z*). In addition, expressions with ± and ∓ are split accordingly, e.g., *i* <sup>±</sup>*<sup>i</sup>* = *e*∓*π/*<sup>2</sup> [98, (4.4.12)] is split into *i* <sup>+</sup>*<sup>i</sup>* = *e*−*π/*<sup>2</sup> and *i* <sup>−</sup>*<sup>i</sup>* = *e*+*π/*2. The analyzer utilizes the attached additional information in each line, i.e., the URL in the DLMF, the used and defned symbols, and the constraints. If a used symbol is defned elsewhere in the DLMF, it performs substitutions. For example, the multi-equation [98, (9.6.2)] is split into six test cases and every *ζ* is replaced by <sup>2</sup> <sup>3</sup> *<sup>z</sup>*3*/*<sup>2</sup> as defned in [98, (9.6.1)]. The substitution is performed on the parse tree of expressions [10]. A defnition is only considered as such, if the defning symbol is identical to the equation's left-hand side. That means, *z* = ( <sup>3</sup> <sup>2</sup> *<sup>ζ</sup>*)3*/*<sup>2</sup> [98, (9.6.10)] is not considered as a defnition for *ζ*. Further, semantic macros are never substituted by their defnitions. Translations for semantic macros are exclusively defned by the authors. For example, the equation [98, (11.5.2)] contains the Struve **K***ν*(*z*)function. Since Mathematica does not contain this function, we defned an alternative translation to its defnition **H***ν*(*z*)−*Yν*(*z*) in [98, (11.2.5)] with the Struve function **H***ν*(*z*) and the Bessel function of the second kind *Yν*(*z*), because both of these functions are supported by Mathematica. The second entry in Table E.2 in Appendix E available in the electronic supplementary material shows the translation for this test case.

Next, the analyzer checks for additional constraints defned by the used symbols recursively. The mentioned Struve **K***ν*(*z*) test case [98, (11.5.2)] contains the Gamma function. Since the defnition of the Gamma function [98, (5.2.1)] has a constraint *z >* 0, the numeric evaluation must respect this constraint too. For this purpose, the case analyzer frst tries to link the variables in constraints to the arguments of the functions. For example, the constraint *z >* 0 sets a constraint for the frst argument *z* of the Gamma function. Next, we check all arguments in the actual test case at the same position. The test case contains Γ(*ν* + 1*/*2). In turn, the variable *z* in the constraint of the defnition of the Gamma function *z >* 0 is replaced by the actual argument used in the test case. This adds the constraint -(*ν* + 1*/*2) *>* 0 to the test case. This process is performed recursively. If a constraint does not contain any variable that is used in the fnal test case, the constraint is dropped.

In total, the case analyzer would identify four additional constraints for the test case [98, (11.5.2)] 8. Note that the constraints may contain variables that do not appear in the actual test case, such as *ν* +*k* + 1 *>* 0. Such constraints do not have any efect on the evaluation because if a constraint cannot be computed to true or false, the constraint is ignored. Unfortunately, this recursive loading of additional constraints may generate impossible conditions in certain cases, such as |Γ(*iy*)| [98, (5.4.3)]. There are no valid real values of *y* such that -(*iy*) *>* 0. In turn, every test value would be fltered out, and the numeric evaluation would not verify the equation. However, such cases are the minority and we were able to increase the number of correct evaluations with this feature.

To avoid a large portion of incorrect calculations, the analyzer flters the dataset before translating the test cases. We apply two flter rules to the case analyzer. First, we flter expressions that do not contain any semantic macros. Due to the limitations of LACAST, these expressions most likely result in wrong translations. Further, it flters out several meaningless expressions

<sup>8</sup> See Table E.2 in Appendix E available in the electronic supplementary material for the applied constraints (including the directly attached constraint *z >* 0 and the manually defned global constraints from Figure 5.3).

that are not verifable, such as *z* = *x* in [98, (4.2.4)]. The result dataset fag these cases with '*Skipped - no semantic math*'. Note that the result dataset still contains the translations for these cases to provide a complete picture of the DLMF. Second, we flter expressions that contain ellipsis9 (e.g., \cdots), approximations, and asymptotics (e.g., <sup>O</sup>(*z*2)) since those expressions cannot be evaluated with the proposed approach. Further, a defnition is skipped if it is not a defnition of a semantic macro, such as [98, (2.3.13)], because defnitions without an appropriate counterpart in the CAS are meaningless to evaluate. Defnitions of semantic macros, on the other hand, are of special interest and remain in the test set since they allow us to test if a function in the CAS obeys the actual mathematical defnition in the DLMF. If the case analyzer (see Figure 5.2) is unable to detect a relation, i.e., split an expression on *<*, ≤, ≥, *>*, =, or =, the line in the dataset is also skipped because the evaluation approach relies on relations to test. After splitting multi-equations (e.g., ±, ∓, *a* = *b* = *c*), fltering out all non-semantic expressions, non-semantic macro defnitions, ellipsis, approximations, and asymptotics, we end up with 6*,* 623 test cases in total from the entire DLMF.

After generating the test case with all constraints, we translate the expression to the CAS representation. Every successfully translated test case is then symbolically verifed, i.e., the CAS tries to simplify the diference of an equation to zero. Non-equation relations simplifes to Booleans. Non-simplifed expressions are verifed numerically for manually defned test values, i.e., we calculate actual numeric values for both sides of an equation and check their equivalence.

### **5.1.3.1 Symbolic Evaluation**

The symbolic evaluation was performed for Maple as described in the following (taken from [2]). Originally, we used the standalone Maple simplify function directly, to symbolically simplify translated formulae. See [26, 28, 148, 190] for other examples of where Maple and other CAS simplifcation procedures has been used elsewhere in the literature. Symbolic simplifcation is performed either on the diference or the division of the left-hand sides and the right-hand sides of extracted formulae. Thus the expected outcome should be respectively either a 0 or 1. Note that other outcomes, such as other numerical outcomes, are particularly interesting, since these may be an indication of errors in the formulae.

In Maple, symbolic simplifcations are made using internally stored relations to other functions. If a simplifcation is available, then in practice it often has to be performed over multiple defned relevant relations. Often, this process fails and Maple is unable to simplify the said expression. We have adopted some techniques which assist Maple in this process. For example, forcing an expression to be converted into another specifc representation, in a pre-processing step, could potentially improve the odds that Maple is able to recognize a possible simplifcation. By trial-and-error, we discovered (and implemented) the following pre-processing steps which signifcantly improve the simplifcation process:


<sup>9</sup> Note that we flter out ellipsis (e.g., \cdots) but not single dots (e.g., \cdot).

Figure 5.3: The ten numeric test values in the complex plane for general variables. The dashed line represents the unit circle |*z*| = 1. At the right, we show the set of values for special variable values and general global constraints. On the right, *i* is referring to a generic variable and not to the imaginary unit.

In comparison to the original approach described in [2], we use the newer version Maple 2020 now. Another feature we added to LACAST is the support of packages in Maple. Some functions are only available in modules (packages) that must be preloaded, such as QPochhammer in the package QDifferenceEquations10. The general simplify method in Maple does not cover *q*-hypergeometric functions. Hence, whenever LACAST loads functions from the *q*-hypergeometric package, the better performing QSimplify method is used. With the WED and the new support for Mathematica in LACAST, we perform the symbolic and numeric tests for Mathematica as well. The symbolic evaluation in Mathematica relies on the full simplifcation11. For Maple and Mathematica, we defned the global assumptions *x, y* ∈ R and *k, n, m* ∈ N. Constraints of test cases are added to their assumptions to support simplifcation. Adding more global assumptions for symbolic computation generally harms the performance since CAS internally uses assumptions for simplifcations. It turned out that by adding more custom assumptions, the number of successfully simplifed expressions decreases.

### **5.1.3.2 Numerical Evaluation**

Defning an accurate test set of values to analyze an equivalence can be an arbitrarily complex process. It would make sense that every expression is tested on specifc values according to the containing functions. However, this laborious process is not suitable for evaluating the entire DML and CAS. It makes more sense to develop a general set of test values that (i) generally covers interesting domains and (ii) avoid singularities, branch cuts, and similar problematic regions. Considering these two attributes, we come up with the ten test points illustrated in Figure 5.3. It contains four complex values on the unit circle and six points on the real axis. The test values cover the general area of interest (complex values in all four quadrants, negative and positive real values) and avoid the typical singularities at {0*,* ±1*,* ±*i*}. In addition, several variables are tied to specifc values for entire sections. Hence, we applied additional global constraints to the test cases.

<sup>10</sup>https://jp.maplesoft.com/support/help/Maple/view.aspx?path=QDifferenceEquations/ QPochhammer [accessed 2021-05-01]

<sup>11</sup>https://reference.wolfram.com/language/ref/FullSimplify.html [accessed 2021-05-01]

The numeric evaluation engine heavily relies on the performance of extracting free variables from an expression. Maple does not provide a function to extract free variables from an expression. Hence, we implemented a custom method frst. Variables are extracted by identifying all names [36]12 from an expression. This will also extract constants which need to be deleted from the list frst. Unfortunately, inbuilt functions in CAS, if available, and our custom implementation for Maple are not very reliable. Mathematica has the undocumented function Reduce'FreeVariables for this purpose. However, both systems, the custom solution in Maple and the inbuilt Mathematica function, have problems distinguishing free variables of entire expressions from the bound variables in MEOMs, e.g., integration and continuous variables. Mathematica sometimes does not extract a variable but returns the unevaluated input instead. We regularly faced this issue for integrals. However, we discovered one example without integrals. For EulerE[n,0] from [98, (24.4.26)], we expected to extract {*n*} as the set of free variables but instead received a set of the unevaluated expression itself {EulerE[n,0]}13. Since the extended version of LACAST handles operators, including bound variables of MEOMs, we drop the use of internal methods in CAS and extend LACAST to extract identifers from an expression. During a translation process, LACAST tags every single identifer as a variable, as long as it is not an element of a MEOM. This simple approach proves to be very efcient since it is implemented alongside the translation process itself and is already more powerful as compared to the existing inbuilt CAS solutions. We defned subscripts of identifers as a part of the identifer, e.g., *z*<sup>1</sup> and *z*<sup>2</sup> are extracted as variables from *z*<sup>1</sup> + *z*<sup>2</sup> rather than *z*.

The general pipeline for a numeric evaluation works as follows. First, we replace all substitutions and extract the variables from the left- and right-hand sides of the test expression via LACAST. For the previously mentioned example of the Struve function [98, (11.5.2)], LACAST identifes two variables in the expression, *ν* and *z*. According to the values in Figure 5.3, *ν* and *z* are set to the general ten values. A numeric test contains every combination of test values for all variables. Hence, we generate 100 test calculations for [98, (11.5.2)]. Afterward, we flter the test values that violate the attached constraints. In the case of the Struve function, we end up with 25 test cases (see also Table E.2 in Appendix E available in the electronic supplementary material).

In addition, we apply a limit of 300 calculations for each test case and abort a computation after 30 seconds due to computational limitations. If the test case generates more than 300 test values, only the frst 300 are used. Finally, we calculate the result for every remaining test value, i.e., we replace every variable by their value and calculate the result. The replacement is done by Mathematica's ReplaceAll method because the more appropriate method With, for unknown reasons, does not always replace all variables by their values. We wrap test expressions in Normal for numeric evaluations to avoid conditional expressions, which may cause incorrect calculations (see Section 5.1.4.1 for a more detailed discussion of conditional outputs). After replacing variables by their values, we trigger numeric computation. If the absolute value of the result is below the defned threshold of 0*.*001 or true (in the case of inequalities), the test calculation is considered successful. A numeric test case is only considered successful if and only if every test calculation was successful. If a numeric test case fails, we store the information on which values it failed and how many of these were successful.

<sup>12</sup>A *name* in Maple is a sequence of one or more characters that uniquely identifes a command, fle, variable, or other entity.

<sup>13</sup>The bug was reported to and confrmed by Wolfram Research Version 12.0.

### **5.1.4 Results**

The translations to Maple and Mathematica, the symbolic results, the numeric computations, and an overview PDF of the reported bugs to Mathematica are available online14. In the following, we mainly focus on Mathematica because of page limitations and because Maple has been investigated more closely by [2]. The results for Maple are also available online. Compared to the baseline (≈ 31%), our improvements doubled the amount translations (≈ 62%) for Maple and reach ≈ 71% for Mathematica. The majority of expressions that cannot be translated contain macros that have no adequate translation pattern to the CAS, such as the macros for interval Weierstrass lattice roots [98, §23.3(i)] and the multivariate hypergeometric function [98, (19.16.9)]. Other errors (6% for Maple and Mathematica) occur for several reasons. For example, out of the 418 errors in translations to Mathematica, 130 caused an error because the MEOM of an operator could not be extracted, 86 contained prime notations that do not refer to diferentiations, 92 failed because of the underlying LATEX parser [402], and in 46 cases, the arguments of a DLMF macro could not be extracted.

Out of 4*,*713 translated expressions, 1*,*235 (26*.*2%) were successfully simplifed by Mathematica (1*,*084 of 4*,*114 or 26*.*3% in Maple). For Mathematica, we also count results that are equal to 0 under certain conditions as successful (called ConditionalExpression). We identifed 65 of these conditional results: 15 of the conditions are equal to constraints that were provided in the surrounding text but not in the info box of the DLMF equation; 30 were produced due to branch cut issues (see Section 5.1.4.1); and 20 were the same as attached in the DLMF but reformulated, e.g., *z* ∈ C\(1*,* ∞) from [98, (25.12.2)] was reformulated to *z* = 0 ∨ *z <* 1. The remaining translated but not symbolically verifed expressions were numerically evaluated for the test values in Figure 5.3. For the 3*,*474 cases, 784 (22*.*6%) were successfully verifed numerically by Mathematica (698 of 2*,*618 or 26*.*7% by Maple15). For 1*,*784 the numeric evaluation failed. In the evaluation process, 655 computations timed out and 180 failed due to errors in Mathematica. Of the 1*,*784 failed cases, 691 failed partially, i.e., there was at least one successful calculation among the tested values. For 1*,*091 all test values failed. The Appendix E, available in the electronic supplementary material, provides a Table E.2 with the results for three sample test cases. The frst case is a false positive evaluation because of a wrong translation. The second case is valid, but the numeric evaluation failed due to a bug in Mathematica (see next subsection). The last example is valid and was verifed numerically but was too complex for symbolic verifcations.

### **5.1.4.1 Error Analysis**

The numeric tests' performance strongly depends on the correct attached and utilized information. The example [98, (1.4.8)] from the DLMF

$$\frac{\mathrm{d}^2 f}{\mathrm{d}x^2} = \frac{\mathrm{d}}{\mathrm{d}x} \left(\frac{\mathrm{d}f}{\mathrm{d}x}\right),\tag{5.2}$$

illustrates the difculty of the task on a relatively easy case16. Here, the argument of *f* was not explicitly given, such as in *f*(*x*). Hence, LACAST translated *f* as a variable. Unfortunately,

<sup>14</sup>https://lacast.wmflabs.org/ [accessed 2021-10-01]

<sup>15</sup>Due to computational issues, 120 cases must have been skipped manually. 292 cases resulted in an error during symbolic verifcation and, therefore, were skipped also for numeric evaluations. Considering these skipped cases as failures, decreases the numerically verifed cases to 23% in Maple.

<sup>16</sup>This is the frst example in Table E.2

this resulted in a false verifcation symbolically and numerically. This type of error mostly appears in the frst three chapters of the DLMF because they use generic functions frequently. We hoped to skip such cases by fltering expressions without semantic macros. Unfortunately, this derivative notation uses the semantic macro deriv. In the future, we flter expressions that contain semantic macros that are not linked to a special function or orthogonal polynomial.

As an attempt to investigate the reliability of the numeric test pipeline, we can run numeric evaluations on symbolically verifed test cases. Since Mathematica already approved a translation symbolically, the numeric test should be successful if the pipeline is reliable. Of the 1*,*235 symbolically successful tests, only 94 (7*.*6%) failed numerically. None of the failed test cases failed entirely, i.e., for every test case, at least one test value was verifed. Manually investigating the failed cases reveal 74 cases that failed due to an Indeterminate response from Mathematica and 5 returned infinity, which clearly indicates that the tested numeric values were invalid, e.g., due to testing on singularities. Of the remaining 15 cases, two were identical: [98, (15.9.2)] and [98, (18.5.9)]. This reduces the remaining failed cases to 14. We evaluated invalid values for 12 of these because the constraints for the values were given in the surrounding text but not in the info boxes. The remaining 2 cases revealed a bug in Mathematica regarding conditional outputs (see below). The results indicate that the numeric test pipeline is reliable, at least for relatively simple cases that were previously symbolically verifed. The main reason for the high number of failed numerical cases in the entire DLMF (1*,*784) are due to missing constraints in the i-boxes and branch cut issues (see Section 5.1.4.1), i.e., we evaluated expressions on invalid values.

**Bug reports** Mathematica has trouble with certain integrals, which, by default, generate conditional outputs if applicable. With the method Normal, we can suppress conditional outputs. However, it only hides the condition rather than evaluating the expression to a non-conditional output. For example, integral expressions in [98, (10.9.1)] are automatically evaluated to the Bessel function *<sup>J</sup>*0(|*z*|) for the condition<sup>17</sup> *<sup>z</sup>* <sup>∈</sup> <sup>R</sup> rather than *<sup>J</sup>*0(*z*) for all *<sup>z</sup>* <sup>∈</sup> <sup>C</sup>. Setting the GenerateConditions<sup>18</sup> option to None does not change the output. Normal only hides *<sup>z</sup>* <sup>∈</sup> <sup>R</sup> but still returns *J*0(|*z*|). To fx this issue, for example in (10.9.1) and (10.9.4), we are forced to set GenerateConditions to false.

Setting GenerateConditions to false, on the other hand, reveals severe errors in several other cases. Consider <sup>∞</sup> *<sup>z</sup> t* <sup>−</sup>1*e*−*<sup>t</sup>* d*t* [98, (8.4.4)], which gets evaluated to Γ(0*, z*) but (condition) for *z >* 0 ∧ *z* = 0. With GenerateConditions set to false, the integral incorrectly evaluates to Γ(0*, z*) + ln(*z*). This happened with the 2 cases mentioned above. With the same setting, the diference of the left- and right-hand sides of [98, (10.43.8)] is evaluated to 0*.*398942 for *x, ν* = 1*.*5. If we evaluate the same expression on *x, ν* = <sup>3</sup> <sup>2</sup> the result is Indeterminate due to infinity. For this issue, one may use NIntegrate rather than Integrate to compute the integral. However, evaluating via NIntegrate decreases the number of successful numeric evaluations in general. We have revealed errors with conditional outputs in (8.4.4), (10.22.39), (10.43.8-10), and (11.5.2) (in [98]). In addition, we identifed one critical error in Mathematica. For [98, (18.17.47)], WED (Mathematica's kernel) ran into a *segmentation fault (core dumped)* for *n >* 1. The kernel of the full version of Mathematica gracefully died without returning an output19.

<sup>17</sup>*J*0(*x*) with *<sup>x</sup>* <sup>∈</sup> <sup>R</sup> is even. Hence, *<sup>J</sup>*0(|*z*|) is correct under the given condition. <sup>18</sup>https://reference.wolfram.com/language/ref/GenerateConditions.html [accessed 2021-05-01]

<sup>19</sup>All errors were reported to and confrmed by Wolfram Research.

Besides Mathematica, we also identifed several issues in the DLMF. None of the newly identifed issues were critical, such as the reported sign error from the previous project [2], but generally refer to missing or wrong attached semantic information. With the generated results, we can efectively fx these errors and further semantically enhance the DLMF. For example, some defnitions are not marked as such, e.g., *Q*(*z*) = <sup>∞</sup> <sup>0</sup> *e*−*ztq*(*t*) d*t* [98, (2.4.2)]. In [98, (10.24.4)], *ν* must be a real value but was linked to a *complex parameter* and *x* should be positive real. An entire group of cases [98, (10.19.10-11)] also discovered the incorrect use of semantic macros. In these formulae, *Pk*(*a*) and *Qk*(*a*) are defned but had been incorrectly marked up as Legendre functions going all the way back to DLMF Version 1.0.0 (May 7, 2010). In some cases, equations are mistakenly marked as defnitions, e.g., [98, (9.10.10)] and [98, (9.13.1)] are annotated as local defnitions of *n*. We also identifed an error in LACAST, which incorrectly translated the exponential integrals *E*1(*z*), Ei(*x*) and Ein(*z*) (defned in [98, §6.2(i)]). A more explanatory overview of discovered, reported, and fxed issues in the DLMF, Mathematica, and Maple is provided in Appendix D available in the electronic supplementary material.

**Branch cut issues** Problems that we regularly faced during evaluation are issues related to multi-valued functions. Multi-valued functions map values from a domain to multiple values in a codomain and frequently appear in the complex analysis of elementary and special functions. Prominent examples are the inverse trigonometric functions, the complex logarithm, or the square root. A proper mathematical description of multi-valued functions requires the complex analysis of Riemann surfaces. Riemann surfaces are one-dimensional complex manifolds associated with a multi-valued function. One usually multiplies the complex domain into a many-layered covering space. The correct properties of multi-valued functions on the complex plane may no longer be valid by their counterpart functions on CAS, e.g., (1*/z*)*<sup>w</sup>* and 1*/*(*zw*) for *z, w* ∈ C and *z* = 0. For example, consider *z, w* ∈ C such that *z* = 0. Then mathematically, (1*/z*)*<sup>w</sup>* always equals 1*/*(*zw*) (when defned) for all points on the Riemann surface with fxed *w*. However, this should certainly not be assumed to be true in CAS, unless very specifc assumptions are adopted (e.g., *<sup>w</sup>* <sup>∈</sup> <sup>Z</sup>*,z >* <sup>0</sup>). For all modern CAS20, this equation is not true. Try, for instance, *<sup>w</sup>* = 1*/*2. Then (1*/z*)1*/*<sup>2</sup> <sup>−</sup> <sup>1</sup>*/z*1*/*<sup>2</sup> = 0 on CAS, nor for *<sup>w</sup>* being any other rational non-integer number.

In order to compute multi-valued functions, CAS choose branch cuts for these functions so that they may evaluate them on their principal branches. Branch cuts may be positioned diferently among CAS [84], e.g., arccot(−<sup>1</sup> <sup>2</sup> ) ≈ 2*.*03 in Maple but is ≈ −1*.*11 in Mathematica. This is certainly not an error and is usually well documented for specifc CAS [108, 171]. However, there is no central database that summarizes branch cuts in diferent CAS or DML. The DLMF as well, explains and defnes their branch cuts carefully but does not carry the information within the info boxes of expressions. Due to complexity, it is rather easy to lose track of branch cut positioning and evaluate expressions on incorrect values. For example, consider the equation [98, (12.7.10)]. A path of *<sup>z</sup>*(*φ*) = *<sup>e</sup>iφ* with *<sup>φ</sup>* <sup>∈</sup> [0*,* <sup>2</sup>*π*] would pass three diferent branch cuts. An accurate evaluation of the values of *z*(*φ*) in CAS require calculations on the three branches using analytic continuation. LACAST and our evaluation frequently fall into the same trap by evaluating values that are no longer on the principal branch used by CAS. To solve this issue, we need to utilize branch cuts not only for every function but also for every equation in the DLMF [10]. The positions of branch cuts are exclusively provided in the text

<sup>20</sup>The authors are not aware of any example of a CAS which treats multi-valued functions without adopting principal branches.

but not in the i-boxes. Adding the information to each equation in the DLMF would be a laborious process because a branch cut position may change according to the used values (see the example [98, (12.7.10)] from above). Our result data, however, would provide benefcial information to update, extend, and maintain the DLMF, e.g., by adding the positions of the branch cuts for every function. An extended discussion about branch cut issues is available in Appendix A available in the electronic supplementary material.

### **5.1.5 Conclude Qantitative Evaluations on the DLMF**

We have presented a novel approach to verify the theoretical digital mathematical library DLMF with the power of two major general-purpose computer algebra systems Maple and Mathematica. With LACAST, we transformed the semantically enhanced LATEX expressions from the DLMF to each CAS. Afterward, we symbolically and numerically evaluated the DLMF expressions in each CAS. Our results are auspicious and provide useful information to maintain and extend the DLMF efciently. We further identifed several errors in Mathematica, Maple [2], the DLMF, and the transformation tool LACAST, proving the proft of the presented verifcation approach. Further, we provide open access to all results, including translations and evaluations21.

The presented results show a promising step towards an answer for our initial research question. By translating an equation from a DML to a CAS, automatic verifcations of that equation in the CAS allows us to detect issues in either the DML source or the CAS implementation. Each analyzed failed verifcation successively improves the DML or the CAS. Further, analyzing a large number of equations from the DML may be used to fnally verify a CAS. In addition, the approach can be extended to cover other DML and CAS by exploiting diferent translation approaches, e.g., via MathML [18] or OpenMath [152].

Nonetheless, the analysis of the results, especially for an entire DML, is cumbersome. Minor missing semantic information, e.g., a missing constraint or not respected branch cut positions, leads to a relatively large number of false positives, i.e., unverifed expressions correct in the DML and the CAS. This makes a generalization of the approach challenging because all semantics of an equation must be taken into account for a trustworthy evaluation. Furthermore, evaluating equations on a small number of discrete values will never provide sufcient confdence to verify a formula, which leads to an unpredictable number of true negatives, i.e., erroneous equations that pass all tests.

After all, we conclude that the approach provides valuable information to complement, improve, and maintain the DLMF, Maple, and Mathematica. A trustworthy verifcation, on the other hand, might be out of reach.

### **5.1.5.1 Future Work**

The resulting dataset provides valuable information about the diferences between CAS and the DLMF. These diferences had not been largely studied in the past and are worthy of analysis. Especially a comprehensive and machine-readable list of branch cut positioning in diferent systems is a desired goal [84]. Hence, we will continue to work closely together with the editors of the DLMF to improve further and expand the available information on the DLMF. Finally, the numeric evaluation approach would beneft from test values dependent on the actual functions involved. For example, the current layout of the test values was designed to avoid

<sup>21</sup>https://lacast.wmflabs.org/ [accessed 2021-10-01]

problematic regions, such as branch cuts. However, for identifying diferences in the DLMF and CAS, especially for analyzing the positioning of branch cuts, an automatic evaluation of these particular values would be very benefcial and can be used to collect a comprehensive, inter-system library of branch cuts. Therefore, we will further study the possibility of linking semantic macros with numeric regions of interest.

Finally, we used LACAST to perform translations solely on semantic LATEX expressions. Real-world mathematics, however, is not available in this semantically enriched format. In the previous chapter, we already developed and discussed a context-sensitive extension for LACAST. This enables LACAST to translate not only semantic LATEX formulae from the DLMF but, considering an informative textual context, also general mathematical expressions to multiple CAS. In the following section, we will evaluate this new extension of LACAST on Wikipedia articles.

### **5.2 Evaluations on Wikipedia**

In the following, resulting from our motivation outlined in Chapter 4 - improving Wikipedia articles - we use Wikipedia for our test dataset to evaluate our context-sensitive extension of L ACAST. More specifcally, we considered every English Wikipedia article that references to the DLMF via the {{dlmf}} template22. This should limit the domain to OPSF problems that we are currently examining. The English Wikipedia contains 104 such pages, of which only one page did not contain any formula (Spheroidal wave function)23. For the entire dataset (the remaining 103 Wikipedia pages), we detected 6*,* 337 formulae in total (including potential erroneous math).

So far, one of our initial three issues from Section 4.2.3 still remains unsolved: how can we determine if a translation was appropriate and complete? We called a translation appropriate, if the intended meaning of a presentational expression *e* ∈ L*<sup>P</sup>* is the same as the translated expression t(*e, X*) ∈ L*C*. However, how can we know the intended semantic meaning of a presentational expression *e* ∈ L*<sup>P</sup>* ? In natural languages, the BLEU score [282] is widely used to judge the quality of a translation. The efectiveness of the BLEU score, however, is questionable when it comes to math translations due to the complexity and high interconnectedness of mathematical formulae. Consider, a translation of the arccotangent function arccot(*x*) was performed to arctan(1*/*(*x*)) in Maple. This translation is correct and even preferred in certain situations to avoid issues with so-called branch cuts (see [13, Section 3.2]). Previously, we developed a new approach that relies on automatic verifcation checks with CAS [2, 11] to verify a translation. This approach is very powerful for large datasets. However, it requires a large and precise amount of semantic data about the involved formulae, including constraints, domains, the position of branch cuts, and other information to reach high accuracy. In turn, we perform this automatic verifcation on the entire 103 Wikipedia pages but additionally created a benchmark dataset with 95 entries for qualitative analysis. To avoid issues like with the BLEU score, we manually evaluated each translation of the 95 test cases.

<sup>22</sup>Templates in Wikitext are placeholders for repetitive information which get resolved by Wikitext parsers. The DLMF-template, for example, adds the external reference for the DLMF to the article.

<sup>23</sup>Retrieved from https : / / en . wikipedia . org / wiki / Special : WhatLinksHere by searching for *Template:Dlmf* [accessed 2021-01-01]

### **5.2.1 Symbolic and Numeric Testing**

The automatic verifcation approach makes the assumption that a correct equation in the domain must remain valid in the codomain after a translation. If the equation is incorrect after a translation, we conclude a translation error. As we have discussed in the previous Section 5.1, we examined two approaches to verify an equation in a CAS. The frst approach tries to symbolically simplify the diference of the left- and right-hand sides of an equation to zero. If the simplifcation returned zero, the equation was symbolically verifed by the CAS. Symbolic simplifcations of CAS, however, are rather limited and may even fail on simple equations. The second approach overcomes this issue by numerically calculating the diference between the left- and right-hand sides of an equation on specifc numeric test values. If the diference is zero (or below a given threshold due to machine accuracy) for every test calculation, the equivalence of an equation was numerically verifed. Clearly, the numeric evaluation approach cannot prove equivalence. However, it can prove disparity and therefore detect an error due to the translation.

In the previous Section 5.1, we saw that the translations by LACAST [13] were so reliable that the combination of symbolic and numeric evaluations was able to detect errors in the domain library (i.e., the DLMF) and the codomain systems (i.e., the CAS Maple and Mathematica) [2, 11]. Unfortunately, the number of false positives, i.e., correct equations that were not verifed symbolically nor numerically, is relatively high. The main reason is unconsidered semantic information, such as constraints for specifc variables or the position of branch cuts. Unconsidered semantic information causes the system to test equivalence on invalid conditions, such as invalid values, and therefore yields inequalities between the left- and right-hand sides of an equation even though the source equation and the translation were correct. Nonetheless, the symbolic and numeric evaluation approach proofs to be very useful also for our translation system. It allows us to quantitatively evaluate a large number of expressions in Wikipedia. In addition, it enables continuous integration testing for mathematics in Wikipedia article revisions. For example, an equation previously verifed by the system that fails after a revision could indicate a poisoned revision of the article. This automatic plausibility check might be a jump start for the ORES system to better maintain the quality of mathematical documents [359]. For changes in math equations, ORES could trigger a plausibility check through our translation and verifcation pipeline and adjust the score of good faith of damaging an edit accordingly.

### **5.2.2 Benchmark Testing**

To compensate for the relatively low number of verifable equations in Wikipedia with the symbolic and numeric evaluation approach, we crafted a benchmark test dataset to qualitatively evaluate the translations. This benchmark includes a single equation (the formulae must contain a top-level equality symbol24, no \text, and no \color macros) randomly picked from each Wikipedia article from our dataset. For eight articles, no such equation was detected. Hence, the benchmark contains 95 test expressions. For each formula, we marked the extracted descriptive terms as irrelevant (0), relevant (1), or highly relevant (2), and manually translated the expressions to semantic LATEX and to Maple and Mathematica. If the formula contained a function for which no appropriate semantic macro exists, the semantic LATEX equals the generic (original) L ATEX of this function. In 18 cases, even the human annotator was unable to appropriately

<sup>24</sup>This excludes equality symbols of deeper levels in the parse tree, e.g., the equality symbols in sums are not considered as such.

Table 5.3: The symbolic and numeric evaluations on all 6*,* 337 expressions from the dataset with the number of translated expressions (**T**), the number of started test evaluations (**Started**), the success rates (**Success**), and the success rates on the DLMF dataset for comparison (DLMF). The DLMF scores refer to the results presented in the previous Section 5.1.


**Symbol Evaluation**


translate the expressions to the CAS, which underlines the difculty of the task. The main reason for a manual translation failure was missing information (the necessary information for an appropriate translation was not given in the article) or it contained elements for which an appropriate translation was not possible, such as contour integrals, approximations, or indefnite lists of arguments with dots (e.g., *a*1*,...,an*). Note that the domain of orthogonal polynomials and special functions is a well-supported domain for many general-purpose CAS, like Maple and Mathematica. Hence, in other domains, such as in group, number, or tensor feld theory, we can expect a signifcant drop of human-translatable expressions25. Since Mathematica is able to import LATEX expressions, we use this import function as a baseline for our translations to Mathematica. We provide full access to the benchmark via our demo website and added an overview to Appendix F.4 available in the electronic supplementary material.

### **5.2.3 Results**

First, we evaluated the 6*,* 337 detected formulae with our automatic evaluation via Maple and Mathematica. Table 5.3 shows an overview of this evaluation. With our translation pipeline, we were able to translate 72*.*6% of mathematical expressions into Maple and 73*.*8% into Mathematica syntax. From these translations, around 40% were symbolically and numerically evaluated (the rest was fltered due to missing equation symbols, illegal characters, etc.). We were able to symbolically verify 11% (Maple) and 15% (Mathematica), and numerically verify 18% (Maple) and 24% (Mathematica). In comparison, the same tests on the manually annotated semantic dataset of DLMF equations [403] reached a success rate of 26% for symbolic and 43% for numeric evaluations [11] (see the previous Section 5.1). Since the DLMF is a manually annotated semantic dataset that provides exclusive access to constraints, substitutions, and other relevant information, we achieve very promising results with our context-sensitive pipeline. To test a theoretical continuous integration pipeline for the ORES system in Wikipedia articles, we also analyzed edits in math equations that have been reverted again. The Bessel's function contains

<sup>25</sup>Note that there are numerous specialized CAS that would cover the mentioned domains too, such as GAP [177], PARI/GP [283], or Cadabra [290].

such an edit on the equation

$$J\_n(x) = \frac{1}{\pi} \int\_0^\pi \cos(n\tau - x\sin\tau) \,d\tau. \tag{5.3}$$

Here, the edit26 changed *Jn*(*x*) to *JZW E*(*x*). Our pipeline was able to symbolically and numerically verify the original expression but failed on the revision. The ORES system could proft from this result and adjust the score according to the automatic verifcation via CAS.

### **5.2.3.1 Descriptive Term Extractions**

Previously, we presumed that our update of the description retrieval approach to MOI would yield better results. In order to check the ranking of retrieved facts, we evaluate the descriptive terms extractions and compare the results with our previously reported F1 scores in [330]. We analyze the performance for a diferent number of retrieved descriptions and diferent depths. Here, the depth refers to the maximum depth of in-going dependencies in the dependency graph to retrieve relevant descriptions. A depth value of zero does not retrieve additional terms from the in-going dependencies but only the noun phrases that are directly annotated to the formula itself. The results for relevance 1 or higher are given in Table 5.4a and for relevance 2 in Table 5.4b. Since we need to retrieve a high number of relevant facts to achieve a complete translation, we are more interested in retrieving any relevant fact rather than a single but precise description. Hence, the performance for relevance 1 is more appropriate for our task. For a better comparison with our previous pipeline [330], we also analyze the performance only on highly relevant descriptions (relevance 2). As expected, for relevant noun phrases, we outperform the reported F1 score (*.*35). For highly relevant entries only, our updated MOI pipeline achieves similar results with an F1 score of *.*385.

#### **5.2.3.2 Semantification**

Since we split our translation pipeline into two steps, semantifcation and mapping, we evaluate the semantifcation transformations frst. To do this, we use our benchmark dataset and perform tree comparisons of our generated transformed tree t*s*(*e, X*) and the semantically enhanced tree using semantic macros. The number of facts we take into account afects the performance. Fewer facts and the transformation might be not complete, i.e., there are still subtrees in *e* that should be already in L*C*. Too many facts increase the risk of false positives, that yield wrong transformations. In order to estimate how many facts we need to retrieve to achieve a complete transformation, we evaluated the comparison on diferent depths D and limit the number of facts with the same MOI, i.e., we only consider the top-ranked facts *f* for an MOI according to sMLP(*f*). In addition, we limit the number of retrieved rules *r<sup>f</sup>* per MC. We observed that an equal limit of retrieved MC per MOI and *r<sup>f</sup>* per MC performed best. Consider we set the limit N to fve, we would retrieve a maximum of 25 facts (fve *r<sup>f</sup>* for each of the fve MC for a single MOI). Typically, the number of retrieved facts *f* is below this limit because similar MC yield similar *r<sup>f</sup>* . In addition, we found that considering replacement patterns with a likelihood of 0% (i.e., the rendered version of this macro never appears in the DLMF), harms performance drastically. This is because semantic macros without any arguments regularly match single letters, for example, Γ representing the gamma function with the argument (*z*)

<sup>26</sup>https://en.wikipedia.org/w/index.php?diff=991994767&oldid=991251002&title=Bessel\_ function&type=revision [accessed 2021-06-23]

Table 5.4: Performance of description extractions via MLP for low (5.4a) and high (5.4b) relevance. In all tables, **D** refers to the depth (following ingoing dependencies) in the dependency graph, **N** is the maximum number of facts and *r<sup>f</sup>* for the same MOI, TP are true positives, and FP are false positives.


being omitted. Hence, we decided to consider only replacement patterns that exist in the DLMF, i.e., sDLMF(*r<sup>f</sup>* ) *>* 0.

Since certain subtrees *e*˜ ⊆ *e* ∈ L*<sup>P</sup>* can be already operator trees, i.e., *e*˜ ∈ L*C*, we calculate a baseline (base) that does not perform any transformations, i.e., *e* = t(*e, X*). The baseline achieves a success rate of 16%. To estimate the impact of our manually defned set of common knowledge facts K, we also evaluated the transformations for *X* = K and achieve a success rate of 29% which is already signifcantly better compared to the baseline. The full pipeline, as described above, achieves a success rate of 48%. Table 5.5 compares the performance. The table shows that depth 1 outperforms depth 0, which intuitively contradicts the F1 scores in Table 5.4a. This underlines the necessity of the dependency graph. We further examine a drop in the success rate for larger N. This is attributable to the fact that *g<sup>f</sup>* (*e*) is not commutative and large N retrieve too many false positive facts *f* with high ranks. We reach the best success rate for depth 1 and N=6. Increasing the depth further only has a marginal impact because, at depth 2, most expressions are already single identifers, which do not provide signifcant information for the translation process.

### **5.2.3.3 Translations from LATEX to CAS**

Mathematica's ability to import TEX expressions will serve as a baseline. While Mathematica does allow to enter a textual context, it does recognize structural information in the expression. For example, the Jacobi polynomial *<sup>P</sup>*(*α,β*) *<sup>n</sup>* (*x*) is correctly imported as JacobiP[n,\[Alpha],\[Beta],x] because no other supported function in Mathematica is linked with this presentation. Table 5.6 compares the performance. The methods base, ck, full are the same as in Table 5.5, but now refer to translations to Mathematica, rather than semantic LATEX. Method full uses the optimal setting as shown in Table 5.5. We consider a

### **136 Chapter 5**

Qualitative and Quantitative Evaluations

Table 5.5: Performance of semantifcation from LATEX to semantic LATEX. **D** refers to the depth (following ingoing dependencies) in the dependency graph, **N** is the maximum number of facts and *r<sup>f</sup>* for the same MOI. The methods base refers to no transformations t(*e, X*) = *e*, ck where *X* = K, and full use the full proposed pipeline. ✔ matches the benchmark entry and ✘ does not match the entry.


translation a *match* (✔) if the returned value by Mathematica equals the returned value by the benchmark. The internal process of Mathematica ensures that the translation is normalized.

We observe that without further improvements, LACAST already outperforms Mathematica's internal import function. Activating the general replacement rules further improved performance. Our full context-aware pipeline achieves the best results. The relatively high ratio of invalid translations for full is owed to the fact that semantic macros without an appropriate translation to Mathematica result in an error during the translation process. The errors ensure that LACAST only performs translations for semantic LATEX if a translation is unambiguous and possible for the containing functions [13]. Note that we were not able to appropriately translate 18 expressions (indicated by the human performance in Table 5.6) as discussed before.

### **5.2.4 Error Analysis & Discussion**

In this section, we briefy summarize the main causes of errors in our translation pipeline. A more extensive analysis can be found in Appendix F.3 (available in the electronic supplementary material) and on our demo page at: https://tpami.wmflabs.org. In the following, we may refer to specifc benchmark entries with the associated ID. Since the benchmark contains randomly picked formulae from the articles, it also contains entries that might not have been properly annotated with math templates or math-tags in the Wikitext. Four entries in the benchmark (28, 43, 78, and 85) were wrongly detected by our engine and contained only parts of the entire formula. In the benchmark, we manually corrected these entries. Aside from the wrong identifcation, we identifed other failure reasons for a translation to semantic LATEX or CAS. In the following, we discuss the main reasons and possible solutions to avoid them, in order of their impact on translation performance.

Table 5.6: Performance comparison for translating LATEX to Mathematica. A translation was successful (**ST**) if it was syntactically verifed by Mathematica (otherwise: **FT**). ✔ refers to matches with the benchmark and ✘ to mismatches. The methods are explained in Section 5.2.3.3.


**LaTeX Translations to Mathematica**

### **5.2.4.1 Defining Equations**

Recognizing an equation as a defnition would have a great impact on performance. As a test, we manually annotated every defnition in the benchmark by replacing the equal sign = with the unambiguous notation := and extended LACAST to recognize such combination as a defnition of the left-hand side27. This resulted in 18 more correct translations (e.g., 66, 68, and 75) and increased the performance from *.*28 to *.*47. The accuracy for this manual improvement is given as Theory\_def in Table 5.6.

The dependency graph may provide benefcial information towards a defnition recognition system for equations. However, rather than assuming that every equation symbol indicates a defnition [214], we propose a more selective approach. Considering one part of an equation (including multi-equations) as an extra MOI would establish additional dependencies in the dependency graph, such as a connection between *x* = sn(*u, k*) and *F*(*x*; *k*) = *u*. A combination with recent advances of defnition recognition in NLP [111, 134, 183, 370] may then allow us to detect *x* as the defning element. The already established dependency between *x* and *F*(*x*; *k*) = *u* can fnally be used to resolve the substitution. Hence, for future research, we will elaborate on the possibility of integrating existing NLP techniques for defnition recognition [111, 134] into our dependency graph concept.

### **5.2.4.2 Missing Information**

Another problem that causes translations to fail is missing facts. For example, the gamma function seems to be considered common knowledge in most articles on OPSF because it is often not specifcally declared by name in the context (e.g., 19 or 31). To test the impact of considering the gamma function as common knowledge, we added a rule *r<sup>f</sup>* to K and attached a low rank to it. The low rank ensures the pattern for the gamma function will be applied late in the list of transformations. This indeed improved performance slightly, enabling a successful translation of three more benchmark entries (Theory\_ck in Table 5.6). This naive

<sup>27</sup>The DLMF did not use this notation, hence LACAST was not capable of translating := in the frst place.

approach, emphasizes the importance of knowing the domain knowledge for specifc articles. In combination with article classifcations [320], we could activate diferent common knowledge sets depending on the specifc domain.

### **5.2.4.3 Non-Matching Replacement Paterns**

An issue we would more regularly faced in domains other than OPSF is non-standard notations. As previously mentioned, without defnition detection, we would not be able to derive transformation rules if the MOI is not given in a standard notation, such as *p*(*a, b, n, z*) for the Jacobi polynomial. This already happens for slight changes that are not covered by the DLMF. For six entries, for instance, we were unable to appropriately replace hypergeometric functions because they used the matrix and array environments in their arguments, while the DLMF (as shown in Table 4.5) only uses \atop for the same visualization. Consequently, none of our replacement patterns matched even though we correctly identifed the expressions as hypergeometric functions. A possible solution to this kind of minor representational changes might be to add more possible presentational variants *<sup>m</sup>* for a semantic macro *<sup>m</sup>*" . Previously [14], we presented a search engine for MOI that allows searching for common notations for a given textual query. Searching for Jacobi polynomials in arXiv.org shows that diferent variants of *<sup>P</sup>*(*α,β*) *<sup>n</sup>* (*x*) are highly related or even equivalently used, such as *<sup>p</sup>*, *<sup>H</sup>*, or *<sup>R</sup>* rather than *<sup>P</sup>*. There were also a couple of other minor issues we identifed during the evaluation, such as synonyms for function names, derivative notations, or non-existent translations for semantic macros. This is also one of the reasons why our semantic LATEX test performed better than the translations to Mathematica. We provide more information on these cases on our demo page.

Implementing the aforementioned improvements will increase the score from *.*26 (26 out of 95) to *.*495 (47 out of 95) for translations from LATEX to Mathematica. We achieved these results based on several heuristics, such as the primary identifer rules or the general replacement patterns, which indicates that we may improve results even further with ML algorithms. However, a missing properly annotated dataset and no appropriate error functions made it difcult to achieve promising results with ML on mathematical translation tasks in the past [1, 15]. Our translation pipeline based on LACAST paves the way towards a baseline that can be used to train ML models in the future. Hence, we will focus on a hybrid approach of rule-based translations via LACAST on the one hand, and ML-based information extraction on the other hand, to further push the limits of our translation pipeline.

### **5.2.5 Conclude Qalitative Evaluations on Wikipedia**

We presented LACAST, the frst context-sensitive translation pipeline for mathematical expressions to the syntax of two major Computer Algebra Systems (CAS), Maple and Mathematica. We demonstrated that the information we need to translate is given as noun phrases in the textual context surrounding a mathematical formula and common knowledge databases that defne notation conventions. We successfully extracted the crucial noun phrases via part-of-speech tagging. Further, we have shown that CAS can automatically verify the translated expressions by performing symbolic and numeric computations. In an evaluation with 104 Wikipedia articles in the domain of orthogonal polynomials and special functions, we verifed 358 formulae using our approach. We identifed one malicious edit with this technique, which was reverted by the community three days later. We have shown that LACAST correctly translates about 27% of mathematical formulae compared to 9% with existing approaches and a 81% human baseline.

**139**

Further, we demonstrated a potential successful translation rate of 46% if LACAST can identify defnitions correctly and 49% with a more comprehensive common knowledge database.

Our translation pipeline has several practical applications for a knowledge database like Wikipedia, such as improving the readability [17] and user experience [150], enabling entity linking for mathematics [320, 17], or allowing for automatic quality checks via CAS [2, 11]. In turn, we plan to integrate [401] our evaluation engine into the existing ORES system to classify changes in complex mathematical equations as potentially damaging or good faith. In addition, the system provides access to diferent semantic formats of a formula, such as multiple CAS syntaxes and semantic LATEX [260]. As shown in the DLMF [260], the semantic encoding of a formula can improve search results for mathematical expressions signifcantly. Hence, we also plan to add the semantic information from our mathematical dependency graph to Wikipedia's math formulae to improve search results [17].

In future work, we aim to mitigate the issues outlined in Section 5.2.4, primarily focusing our eforts on defnition recognitions for mathematical equations. Advances on this matter will enable the support for translations beyond OPSF. In particular, we plan to analyze the efectiveness of associating equations with their nearby context classifcation [111, 134, 183, 370], assuming a defning equation is usually embedded in a defnition context. Apart from expanding the support beyond OPSF, we further focus on improving the verifcation accuracy of the symbolic and numeric evaluation pipeline. In contrast to the evaluations on the DLMF, our evaluation pipeline currently disregards constraints in Wikipedia. While most constraints in the DLMF directly annotate specifc equations, Wikipedia contains constraints in the surrounding context of the formula. We plan to identify constraints with new pattern matches and distance metrics, by assuming that constraints are often short equations (and relations) or set defnitions and appear shortly after or before the formula they are applied to. While we made math in Wikipedia computable, the encyclopedia does not take advantage of this new feature yet. In future work, we will develop an AI [401] (as an extension to the existing ORES system) that makes use of this novel capability.

This Chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).

Sam Diamond - *Murder by Death*

### **CHAPTER 6**

### **Conclusion and Future Work**

### **Contents**


This chapter summarizes and concludes the contribution of this thesis in Section 6.1 and Section 6.2, respectively. Section 6.3 provides an overview of future work projects.

### **6.1 Summary**

In this thesis, we presented novel approaches to translate presentational mathematical encodings into computable formats and to evaluate these translations. We focused on LATEX for the presentational encodings and Computer Algebra Systems (CAS) syntaxes for computable formats. Primarily, we focused on translations to the two major general-purpose CAS Maple and Mathematica.

Every mathematical format serves a specifc purpose and encodes diferent amounts of semantic information into an expression. A presentational format encodes visual information, while computable formats need to uniquely link elements with specifc defnitions (i.e., implementations). There are numerous mathematical formats and conversion tools available. Many roads leads to Rome, thus there are several translation paths from LATEX to CAS syntaxes available, including direct translations via CAS import functions (see Table 1.3). The most well-covered conversion path between mathematical formats is between the standard encodings LATEX and MathML. Since content MathML explicitely encodes semantic information and many CAS are able to import content MathML, the easiest approach for translating LATEX to CAS was to use MathML as an intermediate format. Hence, we developed MathMLben, a MathML benchmark, to evaluate the quality of the translations of several state-of-the-art LATEX to MathML conversion tools.

**Supplementary Information** The online version contains supplementary material available at https://doi.org/10.1007/978-3-658-40473-4\_6.

© The Author(s) 2023 A. Greiner-Petter, *Making Presentation Math Computable*, https://doi.org/10.1007/978-3-658-40473-4\_6

Our benchmark test revealed that existing LATEX conversion tools only consider the semantic information that is explicitly encoded in the given expression, e.g., via visual pattern recognition approaches. For example, Mathematica concludes *<sup>P</sup>*(*α,β*) *<sup>n</sup>* (*x*)to be the Jacobi polynomial because there is no other expression with the same pattern available in Mathematica. Only three of the nine state-of-the-art converters supported content MathML but with insufcient accuracy. The conversion tool LATExml performed best and is able translate semantically enriched formulae in semantic LATEX. Without a manual annotation with semantic macros, however, LATExml also create wrong and incomplete results. In addition, even though CAS often support MathML (including content MathML), there is no public mapping between functions in a Content Dictionary (DC) and functions in the CAS available. Hence, a reliable import of MathML is generally limited to K-14<sup>1</sup> mathematics.

Prior to this thesis, we developed LACAST, a translator from semantic LATEX to the CAS Maple. L ACAST was the frst translator to a CAS syntax that provided additional information about the translation process and provided alternative translations if a direct mapping was unavailable. The frst version of LACAST laid the foundation to solve translation issues related to diferences in the defnitions of functions, e.g., branch cut positioning. However, LACAST required manually crafted semantic LATEX as it is used in the DLMF. Subsequently, we focused on extending LACAST to perform a semantifcation step from LATEX to semantic LATEX based on the information gathered in the surrounding context of a formula.

The semantifcation of mathematical expressions, even though related to other MathIR tasks, was new due to the information needs for a translation to computable formats. Other tasks in MathIR, such as the search for relevant or similar formulae, rarely need to understand the structure of mathematical objects in an expression. For a translation to computable formats, a conversion tool needs to identify the subexpressions representing a specifc formula, determine which formula it represents, what parts of the subexpression are variable or fxed (stem), and how the formula is declared in the context. Existing approaches to semantically enhance mathematical expressions with information from a textual context can be categorized into two groups. The frst group takes single identifers (or other single tokens) and attaches information from the context to these identifers. The second group annotates entire mathematical expressions. Both approaches, however, ignore informative and crucial subexpressions.

As a frst approach for a semantifcation process, we explored the capabilities of word embedding techniques. These models generally perform well on several natural language processing tasks and are able to capture co-occurrences of tokens in large corpora. These co-occurrences seem to model semantic relationships, as it is often shown in the infamous king-queen relationship2. Unfortunately, we were unable to achieve similar results for math embeddings due to fundamental issues in existing embedding approaches. While natural language sentences are a sequential order of words, math formulae are deeply nested structures in which only a few tokens are fxed. However, distinguishing fxed from variable tokens, i.e., identifying the *stem* of a mathematical function, is context-dependent. In order to overcome these representational issues of mathematical expressions, we introduced a new nested concept for mathematical expressions, MOI.

<sup>1</sup> Kindergarten to early college.

<sup>2</sup> The relationship between *king* and *man* is very similar (in terms of cosine diference between the vector representations) to the relationship between *queen* and *woman*.

Figure 6.1: Layers of a mathematical expression with mathematical objects (MOI). MOI in the function layer can be semantically enhanced by semantic LATEX macros. The red tokens are fxed tokens of the MOI and the gray tokens are variable (variables and parameters).

A Mathematical Objects of Interest (MOI) represents a meaningful mathematical subexpression (math object) which might be composed of other MOI. Figure 6.1 shows diferent layers of mathematical objects within the defning formula of Jacobi polynomials. As previously mentioned, most MathIR approaches focus on the context-independent elements in the expression or identifer layer. For translating equations from LATEX to CAS syntaxes, however, the elements in the layers in between both extremes are generally most crucial. If we want to translate an equation to the syntax of CAS, we need to primarily translate MOI in the function layer because those elements are mapped to unique keywords in the CAS. As an approach to explore the usability of the new MOI concept, we performed the frst large-scaled notation study of over 2*.*5 billion mathematical subexpressions in 2 million documents from arXiv and zbMATH. We have shown that the distribution of mathematical subexpressions is similar to words in natural language corpora. Following the idea that mathematical expressions are more comparable to sentences in natural languages, we analyzed the efectiveness of distribution scores, such as BM25, to retrieve MOI for given textual descriptions and achieved good results.

Consequently, we developed a novel semantifcation pipeline based on the MOI concept in which we presume that every isolated mathematical expression in a text is considered to be meaningful. The connections between MOI are modeled by a mathematical dependency graph that links two MOI if one is a subexpression of the other (following a specifc heuristic to allow matches between Γ(*x*) and Γ(*z*)). Each MOI (now a node in the dependency graph) is tagged with descriptions extracted from the textual context. With these descriptions, we can retrieve semantic LATEX macros that represent the MOI. In addition, the dependency graph allows retrieving semantic LATEX macros for each meaningful subexpression too. Finally, we semantically enhance the original LATEX expression by replacing each MOI with the correpsonding semantic L ATEX macro. The resulted enhanced expression can be further translated to CAS syntaxes with L ACAST. Figure 6.2 shows the relevant annotations and dependencies of the defning formula of Jacobi polynomials in the English Wikipedia article. In order to replace LATEX with semantic L ATEX macros, we retrieve all textual descriptions (green boxes) surrounding the formula and all dependent MOI (blue boxes).

**143**

Figure 6.2: The annotated defning formula of Jacobi polynomials (yellow) in the English Wikipedia article. The defning formula depends on two other MOI (blue) in the same article: *<sup>P</sup>*(*α,β*) *<sup>n</sup>* (*x*) and (*<sup>α</sup>* + 1)*n*. Hence, in order to properly translate the defning formula, we need to translate the dependent MOI. This can be achieved by retrieving textual annotations (green) from the surrounding context.

The proposed semantifcation approach requires a semantic LATEX macro to semantically enhance an MOI. The semantic macros were developed for the DLMF and mostly covered OPSF. Generalpurpose CAS, like Maple and Mathematica, natively support functions from this area in general. Hence, there is a signifcant overlap between the functions that have a semantic macro in the DLMF and are natively supported by CAS. Translating general expressions to CAS is often not possible and may require entire new subroutines in the CAS. Consider the prime counting function *π*(*x*) does not exist in Maple. In this case, translating *π*(*x*)to Maple is impossible unless we are able to automatically generate subroutines that are able to compute this function. Often, however, general functions are much simpler and may be represented by known functions, e.g., *f*(*x*) := sin2(*x*). In this case, we need to identify the defnition of *f*(*x*) in order to properly translate it. Translating *f*(*x*)−*g*(*x*), for instance, is meaningless without knowing the defnition of *f*(*x*) and *g*(*x*). However, determining whether an equation declares a defnition remains an open research task for future work.

As an alternative to the new context-sensitive translation pipeline for LACAST, we also experimented with machine translation approaches for LATEX to CAS conversions. We discovered that our machine translation approach is very powerful in adapting conversion rules of other converters, e.g., the LATEX export function of Mathematica or the conversion process by LATExml. Here, we achieved up to 95*.*1% exact match accuracy for undoing an export conversion by Mathematica and 90*.*7% accuracy for undoing a conversion by LATExml. However, we also identifed that such machine translations are very unreliable when it comes to general mathematical expressions. On 100 random selected samples from the DLMF, our machine translation approach correctly translated only 5% of the expressions, compared to 11% by Mathematica or 7\$ by SymPy. Our rule-based translator LACAST achieved 22%. If LACAST performs translations on the original semantic LATEX source of the 100 samples from the DLMF, LACAST achieves 51% accuracy. On non-semantic enhanced cases from Wikipedia articles, our new context-sensitive version of L ACAST correctly translated 27% compared to the state-of-the-art 9% by Mathematica. We have also shown that a proper defnition detection system and an improved common knowledge datatset would boost the number of correct translated expressions to 47%. In comparison, a human annotator was able to translate 81% of the expressions manually.

For determining if a translation was correct or not, one cannot directly adapt established measures for natural language translations. The known BLEU score, for instance, is inappropriate since two entire diferent mathematical expressions can still be equivalent. Hence, we developed a novel evaluation system based on the fact that a translated expression can be further computed by CAS. Consider an equation, which mathematicians manually proved, such as

$$
\sin^2(z) + \cos^2(z) = 1.\tag{6.1}
$$

If the translation of this expression was correct, the equation must be valid in the syntax of the CAS too. Most CAS are powerful enough to verify such simple equivalence, e.g., via symbolic simplifcations. In combination with a comprehensive library of proven equations, such as the DLMF, we could semantically evaluate translations by LACAST.

There is a catch to this evaluation technique. Verifying an equation to be correct can become arbitrarily complex (consider the infamous Riemann hypothesis or Fermat's last conjecture, for example). Hence, automatically verifying an equation with CAS is limited. Nonetheless, CAS are powerful and fexible tools, especially when it comes to numeric evaluations. We developed a two-step evaluation approach to verify an equation in CAS. First, we symbolically simplify the diference of the left- and right-hand sides of an equation to zero. If the result is zero, the equation is considered symbolically verifed. Second, we numerically calculate the diference between the left- and right-hand sides for actual numeric test values if the symbolic verifcation failed. An equation is numerically verifed if the diference is close to zero for all test values (due to machine accuracy). While the numeric evaluation approach never proves equivalence, it can detect disparity. A symbolically or numerically verifed equation can be considered as correctly translated by LACAST.

It turns out that the translations of LACAST are so reliable on DLMF equations that this evaluation technique not only detects issues in the translation process but in the source and target systems as well. Consider there is an error in a test equation, such as in

$$\mathbf{Q}\_{\nu}^{-1/2}(\cos \theta) = -\left(\frac{\pi}{2\sin \theta}\right)^{1/2} \frac{\cos\left(\left(\nu + \frac{1}{2}\right)\theta\right)}{\nu + \frac{1}{2}}.\tag{6.2}$$

The numeric evaluation would fail for most test values indicating that there was an error either in the source equation, i.e., the DLMF, the translator LACAST, or in the target CAS. Hence, we evaluated the entire DLMF with this evaluation technique and identifed numerous of issues in the DLMF, Wikipedia, Maple, and Mathematica. Via LACAST translations and evaluations, for example, we identifed the sign error (the red marked minus) in equation (6.2) in the DLMF [98, (14.5.14)]. This error was fxed with version 1.0.16 in the DLMF. Most notable error reports include this sign error and incorrect semantic annotations in the DLMF, wrong calculations for specifc integrals and bugs in a variable extraction algorithm in Mathematica, incorrect symbolic computations in Maple, and malicious edits in Wikipedia articles3.

**145**

<sup>3</sup> An overview of discovered, reported, and fxed issues in CAS, DLMF, and in the Wikipedia articles is available in Appendix D available in the electronic supplementary material.

Note that, even with our novel semantifcation approach, LACAST cannot be considered as a fnished project (see Section 6.3). Several improvements could be achieved in the future. A crucial issue occur, for instance, if a function is not following the DLMF standard notation, e.g., *<sup>p</sup>*(*n, α, β, x*) for the Jacobi polynomial rather than *<sup>P</sup>*(*α,β*) *<sup>n</sup>* (*x*). In that case, LACAST is incapable of translating the expression. There is, however, no easy solution to this problem. Such a custom notation raise the question about the order of the arguments. For example, in *p*(*a, b, c, d*), we cannot determine if *c* is referring to the degree of the Jacobi polynomial and should be mapped to the frst argument in Mathematica syntax or to any other position. One possible workaround is to fetch and analyze the defnition of *p*(*a, b, c, d*), supposed the defnition is available in the context. By comparing the defnition in the context with the actual Jacobi polynomial defnition in the DLMF or the CAS, we could map each argument with their respective semantics, e.g., *c* to the *degree* of the polynomial. Such a comparison would introduce its own challenges. For example, what if the defnition is not exactly the same as in the DLMF? Moreover, as we pointed out earlier, determining an equation as a defning formula is also an open research question. Recently, a similar issue gained interest among the NLP community with the goal to determine the semantic classifcation of paragraphs and text spans, such as defnitions, theorems, or examples [111, 134, 183, 209, 370]. Most of the remaining issues of LACAST come along with open research questions. Some examples are:


Nonetheless, LACAST, in its current state, already outperforms existing presentational-tocomputational translation solutions, improves the scientifc work cycle of experimenting and publishing, and even helps to correct issues in DML and CAS. LACAST increases the trustworthiness in translation with a transparent communication about the translation decisions [13]. In combination with direct access to CAS' kernels, LACAST also performs automatic verifcation checks on its translations, the source formula, and the system computations. This capability was successfully demonstrated on the DLMF in which we were able to identify numerous issues, from missing or incorrect semantic annotations to wrong constraints and sign errors [2]. With the same evaluation approach, LACAST helps discover bugs in the commercial CAS, Maple and Mathematica [8]. In Wikipedia, LACAST computations allow for detecting malicious edits and the performed semantic enhancements potentially improve the readability and accessibility of mathematical content [11].

In addition, several of the projects on the way to the fnal version of LACAST contributed towards multiple MathIR tasks. The developed MathML benchmark: MathMLben, for instance, is used for research in mathematical entity linking [321]. Our math embedding experiments enabled new approaches, such as centroid search queries and similarity measures for mathematical expressions [15, 323, 332, 404]. Our study about the frequency distributions of mathematical subexpressions in large corpora [14] enabled a new search engine for zbMATH [16], an autocompletion for mathematical inputs, new approaches for plagiarism detection systems4, and literature recommendation systems that will, for the frst time, take mathematical content into account [50]. The mathematical dependency graph generated by LACAST can be embedded in Wikipedia to provide additional semantic information about a formula in a pop-up information window [17]. Lastly, LACAST is currently planned to be integrated into future versions of the DLMF to provide static translations for all DLMF equations and a live interface for general expressions. The source of LACAST is publicly available on https://github.com/gipplab/LaCASt since February 2022.

**LACAST Translation Examples** To conclude with the examples from the introduction of the thesis, LACAST correctly translates every expression in Table 1.2 to Maple, Mathematica, and SymPy. On 100 random selected formulae from the DLMF, LACAST correctly translated 22% and signifcantly outperforms existing converters, such as Mathematica (11%), SymPy (7%), and machine translations (5%). For the semantic LATEX source, LACAST correctly translated 51% of the 100 samples. LACAST addresses the issues of branch cuts and diferences in defnitions between the system by providing additional information and a transparent decision process. For instance, arccot(*z*) is translated to Maple with arccot(z) but LACAST warns about the diferences in the positioning of branch cuts and informs the user about alternative translation patterns, such as I/2\*ln((\$0-I)/(\$0+I)) or arctan(1/(\$0)). Additionally, LACAST provides links to the defnitions of the function, the domains, and the constraints, if available. By providing a textual context that declares *<sup>P</sup>*(*α,β*) *<sup>n</sup>* (*x*) as the Jacobi polynomial and Γ(*z*) as the Gamma function, L ACAST also correctly translates equation (1.1) from the introduction. No CAS import functions nor alternative translations via MathML (followed by an import to the CAS) are capable of correctly translating equation (1.1), all expressions in Table 1.2, or *π*(*x* + *y*) in various contexts. Further, no system, besides LACAST, informs the user about potential issues, such as the diferent branch cuts of arccot(*z*).

To provide a more sophisticated example that underlines the capabilities of LACAST, consider Bailey's transformation of very-well-poised <sup>8</sup>*φ*<sup>7</sup> from the DLMF [98, (17.9.16)]

$$\begin{split} \, \_8\phi\_7 \left( \begin{array}{c} a, qa^{\frac{1}{2}}, -qa^{\frac{1}{2}}, b, c, d, e, f \\ a^{\frac{1}{2}}, -a^{\frac{1}{2}}, aq/b, aq/c, aq/d, aq/e, aq/f; \end{array}; q, \frac{a^2q^2}{bcdef} \right) \\ = \frac{\left(aq, aq/(de), aq/(df), aq/(ef); q\right)\_\infty}{\left(aq/d, aq/e, aq/f, aq/(def); q\right)\_\infty} \,\_4\phi\_3 \left( \begin{array}{c} aq/(bc), d, e, f \\ aq/b, aq/c, def/a \end{array}; q, q \right) \\ + \frac{\left(aq, aq/(bc), d, e, f, a^2q^2/(bdef), a^2q^2/(cdef); q\right)\_\infty}{\left(aq/b, aq/e, aq/d, aq/e, aq/f, a^2q^2/(bdef); dqf/(aq); q\right)\_\infty} \\ \times \,\_4\phi\_3 \left( \begin{array}{c} aq/(de), aq/(df), aq/(ef), a^2q^2/(bdef) \\ a^2q^2/(bdef), a^2q^2/(def) \end{array}; q, q \right). \end{split} \tag{6.3}$$

No CAS nor other translation approaches are capable of interpreting and translating this expression correctly with (or without) semantic annotations or textual descriptions. Mathematica, for example, cannot interpret leading indexes correctly, such as in <sup>8</sup>*φ*7, and is unable to understand (*a, b*; *q*)*<sup>n</sup>* because the multiple *q*-pochhammer symbol does not exist in Mathematica.

**147**

<sup>4</sup> See the DFG (German Research Foundation) fund: *Analyzing Mathematics to Detect Disguised Academic Plagiarism* (https://gepris.dfg.de/gepris/projekt/437179652 [accessed 2021-09-08])


Since the DLMF source uses semantic macros to unambiguously describe the expression, LACAST translates this complicated equation from the DLMF to Mathematica efortlessly by exploiting the defnition of the multiple *q*-pochhammer symbol. Additionally, LACAST provides useful information about the internal decision process (see Figure 6.3). Outside of the DLMF, e.g., in Wikipedia, LACAST would require a context that explains the functions in equation (6.3) to properly disambiguate the components.

### **A short example context that enables LACAST to properly understand equation (6.3)**

The basic hypergeometric function <sup>2</sup>*φ*<sup>2</sup> *a,b c,d* ; *q, z* and the multiple *q*-pochhamer symbol (*a, b*; *q*)*<sup>n</sup>* describes Bailey's transformation of very-well-poised <sup>8</sup>*φ*7.

In combination with this context, LACAST identifes the function patterns and semantically enhances the input expression with DLMF macros. Consequently, LACAST correctly translates the expression to Mathematica, as it did for the original DLMF source equation, and provides the same useful information about the translation decisions, see Figure 6.3. Unfortunately, the equation is too complex for our automatic evaluation approach.

Performing a manual translation for such signifcant expressions is very exhaustive and requires a deep understanding of the CAS. Simple mistakes, such as a sign error or a switched order of arguments, can lead to errors that are very difcult to detect. Additionally, even performing translations to appropriate counterparts in the CAS can quickly yield to undesired behaviour (as we haven seen for translations of arccot(−1)). By providing information about the internal translation decisions, LACAST translations are more trustworthy and comprehensible. LACAST notifes a user about potential issues in regard of branch cut positions or questionable translation decisions, mitigating the chance of wrong, untracable errors. For instance, LACAST is aware of the issue that the *q*-multi-pochhammer symbol is not natively supported by Mathematica but performs an alternative translation instead. Further, LACAST sensitizes users for potential ambiguity issues, such as the use of abbreviations<sup>5</sup> or the ambiguity<sup>6</sup> of *e*.

#### **Translation of Bailey's Transformation of Very-Well-Poised** <sup>8</sup>*φ*<sup>7</sup> **(see equation (6.3) and [98, (17.9.16)])** QHypergeometricPFQ [{ a, q\*(a)^( Divide [1,2 ]),-q\*(a)^( Divide [1,2 ]),b,c,d,e,f },{( a)^( Divide [1,2 ]), -(a)^( Divide [1,2 ]),a \*q/ b,a \*q/ c,a \*q/ d,a \*q/ e,a \*q/f} ,q, Divide [(a)^(2) \*(q)^(2) ,b \*c\*d\*e\*f]] == Divide [ Product [ QPochhammer [ Part [ {a\* q,a \*q/( d\*e),a \*q/( d\*f),a \*q/(e\*f)} ,i ] ,q, Infinity ],{ i,1, Length [ {a\* q,a \*q/(d\*e ) ,a \*q/(d\*f),a \*q/(e\*f )}]}], Product [ QPochhammer [ Part [{a\*q/ d,a \*q/ e,a \*q/ f,a \*q/(d\*e\*f)} ,i ] ,q, Infinity ],{ i,1, Length [{a\*q/ d,a \*q/ e,a \*q/ f,a \*q/(d\*e\*f)}]}]]\* QHypergeometricPFQ [{a\*q/(b\*c ) ,d,e,f },{a\*q/ b,a \*q/ c,d \*e\*f/a} ,q,q ] + Divide [ Product [ QPochhammer [ Part [ {a\* q,a \*q/(b\*c) ,d,e,f, (a)^(2) \*(q)^(2) /(b\*d\* e\*f),(a)^(2) \*(q)^(2) /(c\*d\*e\*f)} ,i ] ,q, Infinity ],{ i,1, Length [{a\* q,a \*q/(b \*c) ,d,e,f, (a)^(2) \*(q)^(2) /( b\*d\*e\*f),(a)^(2) \*(q)^(2) /(c\*d\*e\*f)}]}], Product [ QPochhammer [ Part [ {a\*q/ b,a \*q/ c,a \*q/ d,a \*q/ e,a \*q/ f,(a)^(2) \*(q)^(2) / (b\*c\*d\*e\*f),d \*e\*f/(a\*q)} ,i ] ,q, Infinity ],{ i,1, Length [{a\*q/ b,a \*q/ c,a \*q/ d,a \*q/ e, a\*q/ f,(a)^(2) \*(q)^(2) /(b\*c\*d\*e\*f),d \*e\*f/( a\*q)}]}]] \* QHypergeometricPFQ [ {a\*q/(d\*e ) ,a \*q/(d\*f), a\*q/( e\*f),(a)^(2) \*(q)^(2) /(b\*c\*d\* e\*f )},{(a)^(2) \*(q)^(2) /(b\*d\*e\*f),(a)^(2) \*(q)^(2) /(c\*d\*e\*f), a\*(q)^(2) /(d \*e\*f)} ,q,q ] Linebreaks are manually added to improve readability.

5 An abbreviation may refer to a single variable. For instance, *def* may refers to a variable *defnition* earlier in the article. However, an interpretation of three individual variables (i.e., *d*, *e*, and *f*) is often more reasonable. 6 The letter *e* is commonly used for the Euler's number but can also simply refer to a Latin letter variable.

### **Free Variables**

a, b, c, d, e, f, q

### **Abbreviation Warning**

Found a potential abbreviation: def. This program cannot translate abbreviations. Hence the expression was interpreted as a sequence of multiplications, e.g., etc -> e\*t\*c.

### **Math Constant** *e*

You used a typical letter for a constant (the mathematical constant *e*, known as *Napier's constant* with a value of 2*.*71828182845 *...*). We keep it like it is! But you should know that Mathematica uses E for this constant. If you want to translate it as the constant, use the corresponding DLMF macro \expe.

### **Translation Information for** *<sup>r</sup>φ<sup>s</sup>*

**Name:** Basic hypergeometric (or *q*-hypergeometric) function **Example:** \qgenhyperphi{r }{ s}@@@{a\_1,...,a\_r }{ b\_1 ,..., b\_s }{ q }{ z}

**Translation Pattern: QHypergeometricPFQ**[{\$2},{\$3},\$4,\$5]

**Relevant Links**

DLMF: http://dlmf.nist.gov/17.4#E1 Mathematica: https://reference.wolfram.com/language/ref/QHypergeometricPFQ.html

### **Translation Information for** (*x*; *<sup>q</sup>*)*<sup>n</sup>*

**Name:** *q*-Multi-Pochhammer symbol **Example:** \qmultiPochhammersym{a\_1,\ldots,a\_n}{q }{ n}

Translation pattern unavailable. Use alternative translation pattern instead. **Alternative Translation Pattern:**

**Product**[**QPochhammer**[**Part**[{\$0},i],\$1,\$2],{i,1,Length[{\$0}]}]

**Relevant Links** DLMF: http://dlmf.nist.gov/17.2.E5 Mathematica: unavailable

Figure 6.3: Translation information about the translation of Bailey's transformation of verywell-poised <sup>8</sup>*φ*<sup>7</sup> to Mathematica of equation (6.3) with LACAST (see also the DLMF [98, (17.9.16)]). Since the *q*-Multi-Pochhammer symbol is not natively supported in Mathematica, LACAST uses the alternative translation pattern based on the defnition of the function [98, (17.2.5)]. The information about abbreviations and name of constants are fetched from the POM tagger's lexicon fles [402] that LACAST relies on.

### **6.2 Contributions and Impact of the Thesis**

This thesis made three main contributions:


These contributions resulted in 14 peer-reviewed publications [1, 2, 3, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18] with 2 doctoral program participations [4, 5], and 2 invited talks [6, 7]. The publications were 63 times cited<sup>7</sup> overall. In addition, 3,782 commits8 to a variety of diferent open source projects were performed during the time of the thesis. In the following, we briefy summarize the contributions of this thesis for each of the fve research tasks that were defned in the introduction, Section 1.3.

### **Research Tasks I**

Analyze the strengths and weaknesses of existing semantifcation approaches for translating mathematical expressions to computable formats.

*Contributing Publications:* [1, 9, 12, 18]

To analyze the strengths and weaknesses of existing translation tools, we performed a new evaluation on nine state-of-the-art LATEX to MathML converters, including Mathematica as CAS. We developed a new benchmark for MathML, called MathMLben, to evaluate translations against a manually crafted golden dataset. All converters solely rely on the semantic information that can be retrieved from the structure of an expression, e.g., by pattern matching approaches. In addition, only three converters supported content MathML with an unsatisfactory accuracy.

The main identifed weakness of all analyzed tools was the lack of taking local contextual information into account for the translation process. Through our evaluation, we were able to signifcantly improve LATExml translations by manually annotating LATEX expressions with semantic information via semantic LATEX macros. This performance improvement underlines the need for a semantifcation process that automatically performs semantic annotations based on information from a given context. The poor accuracy of all evaluated conversion tools showed, that translations from LATEX over MathML to CAS have no advantages compared to other translation paths, e.g., over semantic LATEX. Since, semantic LATEX translations to Maple were successfully implemented with the frst version of LACAST, and the accuracy of LATExml signifcantly improved by semantic annotations with semantic macros, we choosed semantic L ATEX as an intermediate format to translate expressions from LATEX to CAS syntaxes.

<sup>7</sup> According to Google Scholar evaluated on 2021-09-16.

<sup>8</sup> According to github.com evaluated on 2021-19-08.

### **Research Tasks II**

Develop a semantifcation process that will improve on the weaknesses of current approaches. *Contributing Publications:* [10, 14, 15]

We accomplished this research task by developing a novel semantifcation process that relies on the textual information in the nearby context of a formula combined with a set of standard knowledge information. As a frst attempt at creating a new common knowledge dataset, we studied math embeddings (i.e., word embeddings for mathematical expressions) to retrieve common co-occurrences between math objects and textural descriptions. This attempt was unsuccessful due to the fexible and nested nature of mathematical notations. Instead, we relied on the DLMF and the lexicon fles of the POM tagger for our common knowledge database.

To analyze the nearby textual context, we retrieve noun phrases as descriptions for mathematical objects. Since the concept of mathematical objects was barely studied in the past, we introduced a new concept of so-called Mathematical Objects of Interest (MOI). The idea behind MOI is that every mathematical subexpression is potentially meaningful. Previous research eforts in the MathIR area only focused either on single identifers or entire mathematical expressions, ignoring the interconnectivity between subexpressions in math formulae. The new MOI concept has proven successful on a variety of diferent tasks in MathIR. Consequentially, we developed a novel semantifcation process based on MOI. The semantifcation process generates a mathematical dependency graph of MOI and annotates each MOI with textual descriptions from their textual context. The dependencies provide access to relevant descriptions of an MOI and its subexpression (which are also MOI). With these descriptions, we retrieve semantic L ATEX macros from the DLMF that replace the original LATEX subexpression. This semantifcation gradually transforms the original LATEX expression into the semantically enhance semantic LATEX encoding.

### **Research Tasks III**

Implement a system for the automated semantifcation of mathematical expressions in scientifc documents. *Contributing Publications:* [11, 16, 17]

We achieved this research task by relying on the results of several previous research projects. The nearby textual analysis was performed with a modifed version of the mathosphere system [279, 329, 330] which was initially designed to retrieve identifer-defniens pairs from a mathematical text. We updated the system to retrieve facts, i.e., pairs of MOI and textual descriptions, from a given text. We further generated the dependency graph of MOI in a document with the approaches outlined by Kristianto et al. [214]. Finally, we extended the POM tagger [402] to create tree patterns of semantic LATEX macros from the DLMF.

This new semantifcation pipeline is performed in four steps. First, we analyze a given text, e.g., a Wikipedia page, to identify all MOI and noun phrases. Second, we build a mathematical dependency graph by defning directed edges between MOI if an MOI is a subexpression of another MOI. Further, each MOI is annotated with noun phrases taken from the same sentence the MOI appears in (including subexpression appearances). Third, we use the noun phrases of an MOI and the noun phrases of dependant MOI to determine replacement patterns to semantic DLMF LATEX macros. This replaces generic LATEX subexpressions by semantic LATEX macros. Fourth, the resulted semantic LATEX expression will be translated towards the target CAS syntax by LACAST (see the next research task).

For this research task, we also elaborated the capabilities of machine translation techniques. We discovered that our sequence-to-sequence model outperforms other machine translation models and achieves very good scores on undoing conversions of rule-based translators, such as Mathematica's LATEX export function and LATExml translations of semantic LATEX. However, we also show that our machine translation are unreliable on other general mathematical expressions that have not been generated by Mathematica or LATExml. We constitute that our machine translation model in its current form is, therefore, unsuitable for performing LATEX to CAS translations.

### **Research Tasks IV**

Implement an extension of the system to provide translations to computer algebra systems. *Contributing Publications:* [3, 11, 13]

We accomplished the research task **IV** with the previously developed translator LACAST. LACAST was originally implemented as a rule-based translator for semantic LATEX expressions in the DLMF and solely supported Maple as a target CAS. In this thesis, we extended LACAST to support more CAS, especially focusing our eforts on Mathematica and (more recently) on SymPy. Further, we implemented additional semantifcation heuristics in order to correctly translate the mathematical operators for integrals, sums, products, and limits. With a study of the prime notations (for derivatives) in the DLMF, we further expand the coverage of LACAST translations specifcally for functions in the DLMF.

Lastely, we added the previously developed semantifcation pipeline to LACAST which fnally turns LACAST into the frst context-sensitive LATEX to CAS translator. LACAST is currently able to parse the context of a given English Wikipedia article. However, the pipeline currently allows analyzing any English text document that encodes mathematical formulae in LATEX.

### **Research Tasks V**

Evaluate the efectiveness of the developed semantifcation and translation system.

*Contributing Publications:* [2, 8, 11]

We accomplished the research task **V** with a combination of a qualitative and quantitative evaluation pipeline. For the qualitative evaluation of LACAST, we manually crafted a benchmark dataset of 95 equations from English Wikipedia articles about OPSF. LACAST was able to correctly transform LATEX into semantic LATEX for 48% of the equations and achieved 27% correct translations to Mathematica overall. In comparison, Mathematica's LATEX import function correctly imported 9% of the expressions and a human annotator was able to translate 81% of the equations to Mathematica. We were able to show that a theoretical concept of defnition detection and a domain-dependent common knowledge database (rather than a fxed common knowledge database) would increase the number of correct translations via LACAST to Mathematica from 27% to 49%. Performing translations from the semantic LATEX dataset DLMF underlines that the most pressing issue still remains in a reliable semantifcation pipeline. LACAST was able to translate 62*.*9% and 72% of all DLMF equations to Maple and Mathematica, respectively. To evaluate the semantifcation, we further analyzed LACAST's ability to retrieve relevant descriptions from the context of a given formula and achieved an F1 score of *.*495 (*.*508 precision and *.*483 recall respecitvely).

Further, we developed a new concept to verify a translated expression based on the assumption that a correct equation in the source database must remain valid after translating to the target system. The computational ability of CAS allows us to perform verifcation checks on translated equations enable us to evaluate large datasets. In particular, we performed two novel approaches, symbolic and numeric evaluations. The symbolic evaluation tries to simplify the diference between the left- and right-hand sides of an equation to zero. The numeric evaluation performs actual numeric calculations on test values and numerically checks the equivalence of an equation's left- and right-hand sides. On the DLMF, LACAST was able to symbolically verify 26*.*3% and 26*.*2% translations to Maple and Mathematica, respectively. Symbolically unverifed expressions were further evaluated numerically. LACAST achieved a numeric verifcation rate of 26*.*7% for Maple and 22*.*6% for Mathematica. In combination, both evaluation techniques verifed 43*.*3% translations for Maple and 42*.*9% translations for Mathematica. Performing the same techniques on the Wikipedia articles resulted in an overall evaluation of 18*.*1% and 23*.*6% for Maple and Mathematica respectively.

The novel verifcation approach has proven to be very successful and even identifed issues in the source database, i.e., Wikipedia articles and the DLMF, and bugs in the commercial target CAS, Maple and Mathematica. With the automatic evaluations from LACAST, we identifed bugs regarding integrals and the variable extraction function in Mathematica, discovered numerous minor issues in the DLMF including a sign error and incorrect semantic annotations, and detected a malicious edit in the Wikipedia edit history in the domain of OPSF. The errors in the Mathematica and the DLMF has been reported and mostly fxed9. An overview of the reports are available in Appendix D available in the electronic supplementary material.

### **6.3 Future Work**

The research advances in MathIR and the development of LACAST in this thesis motivates several follow-up projects. Current plans include to incorporate LACAST into the DLMF for providing translations, automatic evaluation results, and peculiarities compared to multiple CAS for each equation. Additionally, plans are made for including LACAST as a translation-as-a-service endpoint. The developed semantifcation process is also planned to fnd its way into MediaWiki to semantically enhance mathematical content in Wikipedia pages. LACAST has not been open source due to its dependency to the POM tagger [402] and the semantic LATEX macros [260], when the research on this thesis took place. Since February 2022, the source code is publicly available at https://github.com/gipplab/LaCASt.

In this section, we provide a brief overview of four specifc projects for our future work. Section 6.3.1 discusses ideas to improve the shortcomings of LACAST and related open research questions that motivate follow-up projects. Section 6.3.2 discusses how we plan to improve existing LATEX to MathML converters with our semantifcation pipeline. Section 6.3.3 explains the Wikipedia extension for semantic enhanced mathematical expressions. This section was

<sup>9</sup> As of 2021-10-01.

published as a poster together with M. Schubotz [17]. In Section 6.3.4, we discuss a potential multilingual support of LACAST. The multilingual research project will be part of a DAAD-funded post-doctoral scholarship.

### **6.3.1 Improved Translation Pipeline**

The performance of the presented context-sensitive translator LACAST leaves some room for improvements and even motivates entire new research projects. The most pressing shortcoming of LACAST is the lack of generalizability beyond OPSF. The main reason for this shortcoming is the open research task of identifying equations as defnitions. Recent advances of defnition detections in natural languages [111, 134, 183, 370] may pave the way to a reliable classifcation of mathematical equations in the near future. An equation tagged as defnition enables correct translations of dependant formulae in the same document. This enables LACAST to translate general functions, such as *f*(*x*), which are not directly defned in the CAS. Further, a defnition detection of equations may help to build a comprehensive defnition library across entire scientifc corpora with numerous use cases for the mathematical community.

Another issue that remains woefully neglected by our translation tool is the positioning of branch cuts for multi-valued functions. The main reason for that shortcoming is that there is no database or standard available to store and describe branch cuts uniformly across multiple systems and libraries. While branch cuts are openly discussed and presented, their description is often embedded in natural language text descriptions, which harms the machine readability and consequentially the accessibility of the information. In order to consider branch cut positions for a more reliable translation, we need to develop a standard to describe positions uniformly in a machine-readable format. Subsequently, a manual analysis across multiple CAS and libraries, including the DLMF, is required to build a comprehensive database that stores this information. Translation tools may fnally use the database to either provide additional information during a translation process or automatically perform alternative translations based on the stored positioning of branch cuts. The latter, while considerately more difcult, is benefcial to improve the verifcation of equations in the DLMF further.

Lastly, the powerful numeric evaluation approach used to verify a translated expression heavily relies on the chosen numeric test values. LACAST currently uses the same ten numeric test values for all tested equations and flters invalid combinations regarding the constraints. While easy to maintain for many test cases, this approach ignores function-specifc attributes such as domains, branch cuts, singularities, and other essential characteristics. Testing functions on specifc *values of interest* enable several valuable applications. For example, numeric calculations specifcally along the defned branch cuts of the involved functions could help to automatically detect defnition disparity on branch cuts between the systems, e.g., evaluating arccot(−1). In addition, testing values of interest potentially increases the trustworthiness of a numerically verifed equation signifcantly. However, no study about values of interest for functions has been undertaken to the best of our knowledge. It might even be questioned if such values exist for all functions in the DLMF. Further, the value of interest may change depending on the actual argument of the functions. In this case, LACAST would need to automatically adjust the tested values accordingly, which increases the complexity of the task even further.

Figure 6.4: Proposed pipeline to improve existing LATEX to MathML converters.

### **6.3.2 Improve LaTeX to MathML Converters**

As we have described in Section 3.3 in Chapter 3, our outlined translation pipeline can also be used to improve existing LATEX to MathML translators. Figure 6.4 highlights this additional remaining pipeline. In this thesis, we primarily focused on the main pipeline along **1** , **2** , **3** , and **7** . However, the information we gathered in the steps **1** and **2** can also be forwarded to a MathML converter. In Chapter 2, we developed MathMLben, the MathML benchmark, with the help of LATExml, a LATEX to XML converter. We manually added semantic annotations to the source expression in order to improve the conversion by LATExml. For example, the frst entry contains the expression about Van der Waerden numbers *W*(2*, k*). Here, we manually added the link to the corresponding Wikidata ID Q7913892 for *W*, which (together with additional scripts) enabled LATExml to generate a proper, annotated content MathML representation of the expression.

We can now use our semantifcation steps to automate the annotation process. In combination with existing Wikidata entity linking approaches [320, 321, 327], we can also annotate the original expressions with Wikidata IDs as we did manually for MathMLben. While this semantic enrichment process through Wikidata IDs was developed specifcally for LATExml, other LATEX to MathML converters can also proft from such annotations. SnuggleTeX, for example, is a LATEX to XML converter that allows users to pre-defne the semantics of symbols in order to improve the so-called *upconversion*<sup>10</sup> process. One option in particular is the assumeSymbol command.

<sup>10</sup>SnuggleTeX uses this term for referring to a conversion process that requires semantic enrichment steps, e.g., from LATEX to content MathML or Maxima syntax.

Besides annotating single symbols, e.g., via

 $\mathtt{symośpbol} \{e\}\{\mathtt{exponentia}\mathtt{Number}\} $ e $\emptyset,$ 

we can also defne generic functions, such as

$$\triangleq\Sigma\{f\_{n\_k}\}\{\function\} \{f\_{n\_k}(x)\}. \tag{6.5}$$

These pre-defned assumptions enable SnuggleTeX to perform a correct conversion to content MathML or the CAS Maxima.

### **6.3.3 Enhanced Formulae in Wikipedia**

Recently11, we deployed a feature that enables enhancing mathematical formulae in Wikipedia with semantics from Wikidata [308]. For instance, the wikitext code

**Annotated Wikitext Formula**

1 <math qid="Q35875">E=mc^2</math>

now connects the formula *E* = *mc*<sup>2</sup> to the corresponding Wikidata item by creating a hyperlink from the formula to the special page shown in Figure 6.512. The special page displays the formulae together with its name, description, and type, which the page fetches from Wikidata. This information is available for most formulae in all languages. Moreover, the page displays elements of the formula modeled as has part annotations of the Wikidata item.

The has part annotation is not limited to individual identifers but also applicable to complex terms, such as <sup>1</sup> <sup>2</sup>*m*0*v*2, i.e., the kinetic energy approximation for slow velocities13. For example, we demonstrated using the annotation for the Grothendieck–Riemann–Roch theorem<sup>14</sup>

$$\text{ch}(f\_!\mathcal{F}^\bullet)\text{td}(Y) = f\_\*(\text{ch}(\mathcal{F}^\bullet)\text{td}(X)).\tag{6.6}$$

The smooth quasi-projective schemes *X* and *Y* in the theorem lack Wikipedia articles. However, dedicated articles on *quasi-projective variety* and *smooth scheme* exist. We proposed modeling this situation by creating the new Wikidata item *smooth quasi-projective scheme*15, which links to the existing articles as subclasses. To create a clickable link from the Wikidata item to Wikipedia, we could create a new Wikipedia article on *smooth quasi-projective scheme*. Alternatively, we could add a new section on *smooth quasi-projective scheme* to the article on *quasi-projective variety* and create a redirect from the Wikidata item to the new section.

Aside from implementing the new feature, defning a decision-making process for the integration of math rendering features into Wikipedia was equally important. For this purpose, we

<sup>11</sup>A. Greiner-Petter: *Link Wikipedia Articles from Specialpage Math Formula Information*, GitHub Commit to mediawiki-extensions-math on 27th November 2020: https : / / github . co m /wiki m edia / mediawiki extensions-Math/commit/912866b976fbdcd94fda3062244d23a34c5e7a76

<sup>12</sup>https://en.wikipedia.org/wiki/Special:MathWikibase?qid=Q35875 [accessed 2021-08-18]

<sup>13</sup>https : / / en . wikipedia . org/w/index . php ? oldid = 939835125 # Mass \ T1 \ textendashvelocity \_ relationship [accessed 2021-08-18]

<sup>14</sup>https://en.wikipedia.org/w/index.php?title=Special:MathWikibase&qid=Q1899432 [accessed 2021-08-18]

<sup>15</sup>https://www.wikidata.org/wiki/Q85397895 [accessed 2021-08-18]

founded the Wikimedia Community Group Math16 as an international steering committee with authority to decide on future features of the math rendering component of Wikipedia.

### **mass-energy equivalence**

physical law

### Math Formula Information

**Formula:** ܧ ൌ ݉ܿ<sup>ଶ</sup>

**Name:** mass-energy equivalence

**Type:** physical law

**Description:** mass and energy are proportionate measures of the same underlying property of an object

### Elements of the Formula


Figure 6.5: Semantic enhancement of the formula *E* = *mc*2*.*

The new feature helps Wikipedia users to better understand the meaning of mathematical formulae by providing details on the elements of formulae. Because the new feature is available in all language editions of Wikipedia, all users beneft from the improvement. Rolling out the feature for all languages was important to us because using Wikipedia for more in-depth investigations is signifcantly more prevalent in languages other than English [226]. Nevertheless, also in the English Wikipedia, fewer than one percent of the articles have a quality rating of good or higher [299]. Providing better tool support to editors can help in raising the quality of articles. In that regard, our semantic enhancements of mathematical formulae will fank other semi-automated methods, such as recommending sections [299] and related articles [337].

To stimulate the wide-spread adoption of semantic annotations for mathematical formulae, we are currently working on tools that support editors in

creating the annotations and, therefore, successively determing the ground truth of mathematics in Wikipedia. With AnnoMathTex [319], we are developing a tool that facilitates annotating mathematical formulae by providing a graphical user interface that includes machine learning assisted suggestions [14] for annotations. Moreover, we will integrate a feld into the visual wikitext editor that will suggest Wikipedia authors to link the Wikidata id of a formula if the formula is in the Wikidata database. Improved tool support will particularly enable smaller language editions of Wikipedia to beneft from the new feature because the annotations performed in any language will be available in all languages automatically.

Additionally, our recent advances with LACAST on the Wikipedia dataset allows us to automatically verify equations in Wikipedia to some degree. We currently working on a system that automatically triggers the verifcation engine on edits in mathematical content. This would allow us to generate a live feed of verifed and not verifed mathematical edits in the entire Wikipedia. While this presumably generates a lot of interesting data for numerous of projects, it will also serve as a proof-of-concept to integrate the system into existing quality control mechanisms. On the long run, we hope to integrate the verifcation technique into the existing *Objective Revision Evaluation Service* (ORES) [144], such as other recently ermeged ORES extensions [359, 401].

<sup>16</sup>https://meta.wikimedia.org/wiki/Wikimedia\_Community\_User\_Group\_Math [accessed 2021-08-18]

### **6.3.4 Language Independence**

The multilingual aspect of our translator becomes more and more important with the focus on Wikipedia. Since Wikipedia is a multilingual encyclopedia, providing a language-independent semantifcation process is a desired task. In general, the concept of our developed semantifcation approach is language independent. The pipeline relies on a POS tagger to tag tokens and generate parse trees of the sentences. The score of an MOI-description pair is calculated based on the distance between both tokens in the parse tree. Consequentially, we can presume that our semantifcation pipeline works for other languages too, as long as there is a reliable POS tagger for that language available. However, we already noticed minor issues with the well-developed CoreNLP's POS tagger for the English language when using the MLP approach. As a reminder, the MLP approach suggested masking mathematical elements by placeholders before using a POS tagger on the sentence. For example, in the following sentence

### **Example sentence including math**

```
1 The Jacobi polynomial P(α,β) n (x) is an orthogonal polynomial.
```
the mathematical expressions is replaced by a placeholder MATH\_1.

### **Example sentence with masked math**

#### 1 The Jacobi polynomial MATH\_1 is an orthogonal polynomial.

While this approach works well in many cases, in this particular example, CoreNLP's POS tagger<sup>17</sup> tags both polynomial tokens as adjactives (JJ) while both should be tagged as nouns (NN).

The underlying issue is that the MLP approach presumes math expressions to represent noun tokens. However, the *mathematical language* is generally more complex compared to that simple scheme [138]. This language can become quite diferent from general natural language communication. The mathematical language introduces a technical terminology with entirely new terms, such as '*functor*', changes the meaning of existing vocabulary, such as '*group*' or '*ring*', and even defne entire phrases to represent math concepts, such as '*without loss of generality*' or '*almost surely*'. All these specifcs need to be adopted by a POS tagger. Math notation is often part of a natural language sentence but does not necessarily represent a logical token. In addition, we presume that mathematical expressions are generally languageindependent. However, its notation style may change from language to language, even for simple cases. For example, while the US or Germany uses ≥ to express a greater or equal relation, the notation is more common in Japan. Considering the sheer amount of diferent math notations, it might not be obvious to a student from Japan that ≥ and refer to the same relation. Yet, these symbols are so basic that most authors, even in educational literature, would probably not explicitly declare their meaning in the context. This issue grows with a more and more educated audience. For example, math educational books written for math students in universities rarely mention the specifc meanings of logic symbols (e.g. ∧, ∨), quantifers (e.g. ∀, ∃), or set notations (e.g. ∩ and ∪).

<sup>17</sup>Tested with CoreNLP's version 4.2.2.

Unfortunately, the multilingual aspects of mathematics have barely been studied in the past. D. Halbach [143] recently tried to take advantage of the multilingual versions of Wikipedia articles to identify defning formulae of that article. A defning formula of an article is the mathematical expression that is the main subject of that article. For example, *<sup>P</sup>*(*α,β*) *<sup>n</sup>* (*x*) can be considered as the defning formula of the article about Jacobi polynomials. D. Halbach assumed that a mathematical expression that appears in multiple language versions of the same article is a good candidate for such a defning formula. Unfortunately, it turned out that diferent languages tend to use diferent visualizations of the same formula. For example, he showed that Schwarz's theorem in the Polish, English, German and French Wikipedia articles use diferent mathematical formulae for the same concept. This result indicates that the semantifcation approach we developed in this thesis may not be easily generalized for other languages. In addition, there is no POS tagger available that is specialized in mathematical content.

In combination with researchers from the National Institute of Standards and Technology (NIST) in the US, the National Institute of Informatics (NII) in Japan, and the University of Wuppertal in Germany, we plan to study the multilingual aspects of mathematical languages to analyze language-specifc notation and declaration diferences. This project is part of a post-doctoral DAAD scholarship and includes training a math-specifc NLP model for better POS tagging of mathematical content articles.

This Chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).

### **BACK MATTER**

### **Glossary**

### **Symbols**

*X* = {*f*|*f* ∈D∪K∧ (*f* ∈K⇒ *f /*∈ D)}

Our defnition of a mathematical context *X* defned in (4.2) on page 104. A context is a set of facts *f* in a document D and a set of common knowledge facts K so that document facts overwrite common knowledge facts. 106, 107, 138

### L*<sup>C</sup>* **— Mathematical Content Languages**

Denotes mathematical content languages (CL), such as semantic LATEX, content MathML, or CAS syntaxes. 106, 107, 112, 134, 137, 138

### L*<sup>M</sup>* **— Computer Algebra System Languages**

Refers to CAS languages in general, such as the syntax of Mathematica, Maple, or SymPy inputs.. 107, 109, 110

### L*<sup>P</sup>* **— Mathematical Presentation Languages**

Denotes mathematical presentational languages (PL), such as presentation MathML or L ATEX. 106, 107, 109–112, 134, 138

$$\text{mBM25}(t,d) = \max\_{d \in D} \frac{(k+1)\,\text{IDF}(t)\,\text{ITF}(t,d)\,\text{TF}(t,d)}{\max\_{t' \in d|\_{c(t)}} \,\text{TF}(t',d) + k\left(1 - b + \frac{b\,\text{AVG}\,\text{nL}}{|d|\,\text{AVG}\_C}\right)}$$

Our mathematical BM25 ranking to measure the importance of a given MOI *t* in a document *d* ∈ *D* which is part of a corpora *D*. IDF(*t*) is the inverse-document frequency, ITF(*t, d*) the inverse-term frequency of *t* in *d*, TF(*t, d*) the term frequency of *t* in *d*, AVGDL the average document length (number of terms) in *D*, AVG*<sup>C</sup>* the average complexity of terms in *D*, *c*(*t*) the complexity of *t*, and *b*, *k* are parameters. 85

sDLMF(*r<sup>f</sup>* )**:**

The probability score for a replacement rule *<sup>r</sup><sup>f</sup>* <sup>=</sup> *<sup>m</sup>* <sup>→</sup> *<sup>m</sup>*" . This score is the probability that *<sup>m</sup>*" is rendered as *<sup>m</sup>* in the DLMF. For example, the general hypergeometric function never omits arguments, such as in <sup>2</sup>*F*1(*z*) in the DLMF. Hence, the probability of <sup>2</sup>*F*1(*z*) is 0. In contrast, in 19*.*7%, the function uses the linear rendered form <sup>2</sup>*F*1(*a, b*; *c*; *z*). 114, 137

sES(*f*)=sES(MLP*,* MC)

The normalized Elasticsearch score for a retrieved semantic macro *<sup>m</sup>*" for the given MC <sup>∈</sup> *<sup>f</sup>*. This score is higher if MC better matches the description of the semantic macro *<sup>m</sup>*" . Since ES provide absolute scores, this score is normalized to the best ftting hit, i.e., the frst retrieved result is always scored 1. 113, 114

sMLP(*f*)=sMLP(MLP*,* MC)

The score of the MLP engine [330] for a given fact *f* which depends on (1) the distance between the MOI and its frst occurrence in the document D, (2) the distance in the natural language syntax tree between the MOI and the MC, and (3) if the MOI and MC matches pre-defned patterns. 112–114, 137

### t(*e, X*)=t*m*(t*s*(*e, X*))

Our translator function follows a two step strategy of which the frst step is a semantifcation t*s*(*e, X*) followed by a rule-based transformation t*m*(*e*). 106, 107, 134, 138

### t*m*(*e*) = *gr*<sup>1</sup> ◦···◦ *gr<sup>n</sup>* (*e*)

A rule-based translation function that performs translations on a set of rules *r<sup>k</sup>* ∈ <sup>R</sup>*C*<sup>1</sup> *<sup>C</sup>*<sup>2</sup> *, k* = 1*,...,n* from a content language *C*<sup>1</sup> to another content language *C*2. Similar to the semantifcation function, it performs graph transformations *g<sup>r</sup>* based on the rules. Example implementations are LACAST or SymPy's latex2sympy function. 106, 107

### t*s*(*e, X*) = *gf*<sup>1</sup> ◦···◦ *gf<sup>n</sup>* (*e*)

A fact-based semantifcation translation function takes an expression *e* and a context *X* to perform a series of graph transformations *g<sup>f</sup>* defned by the facts *f* to semantically enhance subtrees of *e*. 106–108, 137

### **A**

### **AI — Artifcial Intelligence**

A broad research feld with the focus on machine (artifcial) intelligence. 103, 142

### **AJIM — Aslib Journal of Information Management**

An international journal with an 5-year IF of 2*.*653 in library and information science with focus on information and data management. According to https://academicaccelerator.com/5-Year-Impact-Factor/Aslib-Journal-of-Information-Management [accessed 2021-10-01] it is placed 33 of 227 journals in the feld of library and information sciences. 9, 15, 163

### **arXiv:**

Is a pre-print archive for scientifc papers in a variaty of diferent felds, such as mathematics, physics, or computer science. See arxiv.org [accessed 2021-10-01] for more information. 40, 62–66, 68, 70, 71, 73–75, 78–84, 86, 91, 92, 99, 101, 103, 144, 192

### **arXMLiv:**

An HTML5 (including MathML) dataset based on the arXiv articles. The HTML5 was generated via LATExml and is available at https : / / sig m athling . kwarc . in f o / resources/arxmliv-dataset-2020/ [accessed 2021-10-01] [132]. 65, 74

### **Axiom:**

Is a free, general-purpose CAS frst developed by IBM around 1965 (named *Scratchpad* at that time). Since 2001, Axiom is open source under a moifed BSD license and available on GitHub at https://github.com/daly/axiom [accessed 2021-10-01] [173]. 5, 34, 35

### **B**

### **BLEU — Bilingual Evaluation Understudy**

Is an algorithm to measure the quality of translated texts frst described by Papineni et al. [282] in 2001. The algorithm presumes the closer (more sharing *n*-grams) a translation is to human translations the better it is. 14, 99, 100, 134, 146

### **BM25 — Okapi BM25**

Is a ranking function to calculate the relevance of results in a search engine [310]. The underlying idea of BM25 is that words that appear regularly only in a few documents are more *important* for that document than words that appear everywhere across the entire corpora. 12, 73, 83, 85, 113, 145

### **C**

### **CAS — Computer Algebra System(s)**

A mathematical software that allows one to work with mathematical expressions, e.g., by manipulating, computing, or ploting them. The acronym CAS, in this thesis, is referring to a single or multiple systems depending on the context. ix, xi, xii, 1–8, 10, 13–15, 19–22, 24–36, 38, 40–43, 47, 52, 55, 58–60, 93, 95, 97, 103–108, 111, 115–120, 123–129, 131–136, 138, 139, 141, 143–150, 152, 154–156, 158, 163–165, 168, 171, 174, 175, 180, 193

### **CD — Content Dictionary**

Content dictionaries are structured documents that contain the defnition of mathematical concepts. See the OpenMath specifcation for more details [53]. 23–26, 31, 57, 58, 143

### **CICM — Conference on Intelligent Computer Mathematics**

An annual international conference on mathematical computation and information systems (has a CORE rank of C since 2021). 9, 10, 15, 116

#### **CL — Content Language**

Content languages are languages that encode mainly semantic (content) information, such as content MathML, OpenMath, or CAS syntaxes. 43

#### **CLEF — Conference and Labs of the Evaluation Forum**

An annual international conference for systematic evaluation of information access systems. 9

### **cMML — Content MathML**

Content MathML encodes the meaning of mathematical notations. For more information see the explanations about MathML. 22, 23

#### **CORE — Computing Research and Education Association of Australasia**

Is an association of university departments that provide assessments of major conferences in the computing disciplines. The main categories are A\* (fagship), A (excellent), B good to very good, and C for other ranked conferences that meet minimum standards, see http://portal.core.edu.au/conf-ranks/ [accessed 2021-10-01]. 8, 9

### **CoreNLP:**

CoreNLP is a Java library for natural language processing tasks developed by Stanford NLP Group and includes tokenizer, POS taggers, lemmatizers and more [240]. 109, 110, 160, 185, 186

### **D**

### **DBOW-PV — Distributed Bag-of-Words of Paragraph Vectors**

An approach to embed entire paragraphs into single vectors introduced by Le and Mikolov [222]. 67–69

### **DL — Deep Learning**

Is a broad family of machine learning methods that uses neural networks for learning features. 61

### **DLMF — Digital Library of Mathematical Functions**

A digital version [98] of *NIST's Handbook of Mathematical Functions* [276]. The DLMF (or the book respectively) is a standard reference for OPSF and provides access to numerous of defnitions, identities, plots, and more. ix, x, xii, 1, 4, 5, 8, 12, 14, 15, 17, 25, 28, 30–33, 35, 40, 46, 47, 49–51, 56, 58, 62, 63, 65, 66, 93–95, 97, 98, 100, 101, 103–109, 112–119, 121–126, 129–137, 139–142, 144–156, 163–165, 168, 174–183, 190–192

### **DML — Digital Mathematical Library**

A general digital library that specifcally focuses on mathematics. 63, 115–118, 123, 128, 132, 133, 148, 164

### **DRMF — Digital Library of Mathematical Formulae**

An outgrowth of the DLMF project [77, 78]. 30, 32

### **E**

#### **EMNLP — Empirical Methods in Natural Language Processing**

An annual international conference on natural language processing (has a CORE rank of A). 9

#### **ES — Elasticsearch**

A search engine written in Java that uses the open-source search engine library Apache Lucene, see https : / / www . elastic . co/ and https : / / lucene . apache . org/ [accessed 2021-07-02]. 86, 88, 113, 193

### **G**

### **GUI — Graphical User Interface**

A visual interface that allows for interacting with data or software. 48, 49

### **H**

### **HTML — HyperText Markup Language**

The standard markup language for web documents. 23, 74

### **I**

### **ICMS — International Congress on Mathematical Software**

A bi-annual congress that gathers the mathematicians, scientists and programmers who are interested in the development of mathematical sofware. 9, 13, 60

### **J**

#### **JCDL — Joint Conference on Digital Libraries**

An annual major conference in the feld of digital libraries (had a CORE rank of A\* until it was unranked in 2021 because the CORE committee removed the entire digital library domain from their ranking scheme). 9, 10, 14, 19, 163, 166

### **L**

### **LACAST—LATEX to CAS translator**

Is the name of the framework we developed in this thesis to translate mathematical L ATEX to CAS. The frst version of LACAST was part of the author's Master's thesis and supported translations only from semantic LATEX to Maple [3, 13]. Within this thesis, we extended LACAST by supporting general LATEX [11] expressions and additional CAS [8], such as Mathematica. The source of LACAST is publicly available on https://github. com/gipplab/LaCASt since February 2022.. ix–xii, 7, 8, 10, 14–17, 28–30, 32, 58, 95, 100, 101, 105–107, 109–111, 114–119, 121, 122, 124–134, 139, 141, 144–152, 154–156, 159, 163, 168, 171, 174, 180, 191

### **LATEX:**

Is an extension of the typesetting system TEX used for document preparation. LATEX provides additional macros on top of TEX allowing the writer to focus more on the content of a document rather than on the exact layout. Since this thesis focus on mathematical expressions in LATEX, there is not much diference between TEX and LATEX. ix, xi, 1–3, 5–8, 10, 13, 19–22, 24, 25, 27–35, 37–42, 45–54, 56–60, 74, 83, 88, 93, 94, 97–100, 102–108, 110, 112, 113, 116, 118, 121, 129, 132, 135, 138–141, 143–146, 152–154, 156, 157, 166, 174–180, 188–193

### **LATExml:**

Is a tool developed by B. Miller to convert LATEX documents to a variaty of other formats, such as XML or HTML. The tool can also be used to transform single mathematical LATEX expressions to math specifc formats, such as MathML, or image formats, such as SVG. More infomation can be found at *LaTeXML: A LATEX to XML/HTML/MathML Converter*, https://dlmf.nist.gov/LaTeXML/ [accessed 2021-10-01]. 11, 32, 33, 38, 46–51, 53, 58, 74, 75, 77, 78, 83, 94, 98, 102, 143, 146, 152, 154, 157

### **M**

### **Maple:**

One of the major general-purpose CAS [36] developed by *Maplesoft*. If not stated otherwise, we refer to the version 2020.2. ix, xii, 1, 2, 4–8, 10, 15, 20, 21, 26, 28, 31, 32, 34, 35, 38, 43, 52, 58, 103, 104, 107–109, 115–120, 123–125, 127–136, 141, 143–145, 147–149, 152, 154, 155, 164, 165, 168, 169, 180, 189, 193

### **Mathematica:**

One of the major general-purpose CAS [393] developed by *Wolfram Research*. If not stated otherwise, we refer to version 12.1.1. ix, xii, 1–6, 8, 10, 15, 20, 21, 26, 28–31, 35, 41, 42, 52, 97–105, 107–109, 114, 115, 117, 119, 124, 125, 127–136, 138–141, 143, 145–152, 154, 155, 164, 169–174, 180, 181, 189, 193

### **MathIR — Mathematical Information Retrieval**

Is a sub-feld of the Information Retrieval (IR) research area and as such focusing on obtaining information (mostly semantics) or retrieving relevant mathematical expressions. Note that MIR is another common acronym for mathematical information retrieval. In this thesis, we stick with the less overloaded and more precise abbreviation MathIR. ix, xi, 1, 6, 8, 11, 19, 39, 40, 54, 55, 59–63, 65, 71–73, 83, 105, 144, 148, 153, 155

### **MathML — Mathematical Markup Language**

An XML structured standard for representing mathematical notations in web pages and other digital documents [169]. MathML allows to encode the meaning of mathematical notations to some degree, which is often referred to *content MathML*. In contrast, *presentational MathML* refers only on the visual encoding of math formulae. In case a math formula is encoded in presentational and content MathML at the same time, it is often called *parallel markup MathML*. 2, 4, 6–8, 10–12, 19–28, 32–35, 37, 39, 41, 43–47, 49–53, 57, 58, 62, 63, 65, 74–78, 92, 94, 105, 106, 117, 133, 143, 144, 148, 149, 152, 156–158, 166

### **MathMLben — MathML Benchmark**

We developed MathMLben as a benchmark dataset for measuring the quality of MathML markup of mathematical formulae appearing in a textual context. See Section 2.3.2 on page 43 for further details. 10, 11, 45, 46, 51, 67, 94, 143, 148, 152, 157

### **MATLAB:**

Is one of the major proprietary CAS with a specifc focus on numeric computations developed by MathWorks. MATLAB is also the name of the underlying programming language the CAS MATLAB uses [164, 246]. 1, 5, 10, 35

### **Maxima:**

Is an open source general-purpose CAS frst released in 1982 (originally developed as a branch of the predecessor CAS Macsyma [264]) and is still actively maintained [324]. 2–4, 28, 29, 35, 157, 158

### **mBM25 — Mathematical Okapi BM25**

Our extension of the BM25 score for mathematical expressions. 85, 88–90

### **MC — Mathematical Concept**

Is a term referring to the concept that defnes a mathematical expression including its visual appearance, underlying defnition, constraints, domains, and other semantic information [9]. In the context of this thesis, we simplify this concept and presume that a name (or noun phrase) sufciently specifes a concept so that the name (or noun phrase) is considered a representative MC. 106, 108–113, 137, 185

### **MEOM — Mathematically Essential Operator Metadata**

Describes the metadata, i.e., argument(s) and bound variable(s), in sums, products, integrals, and limit operators. 120–122, 124, 128, 129

### **MFS — Mathematical Functions Site**

A dataset of mathematical functions and relations maintained by Wolfram Research. The dataset is available at https://functions.wolfram.com/ [accessed 2021-10-01]. 98–102

### **MKM — Mathematical Knowledge Management**

Is the general study of harvesting, maintaining, or managing mathematical information in literature and databases. 61, 62, 65

### **ML — Machine Learning**

Is a computer science research feld (often described as a subfeld of artifcal intelligence) with the relatively broad goal of making predections for unseen data based on trained data. 40, 61, 63, 69–71, 97, 103

### **MLP — Mathematical Language Processing**

Mathematical language processing describes to the technical process of analyzing mathematical texts. A specifc MLP task is the mapping of textual descriptions to components of mathematical formulae (see Schubotz et al. [279]), such as mathematical identifer. 61, 62, 65, 72, 110, 137, 160, 185, 186, 188

### **MOI — Mathematical Objects of Interest**

Is a term referring to subexpressions in mathematical formulae with a specifc meaning [9]. One can consider these parts as elements of general interest. 12, 13, 60, 73, 76, 86, 91–94, 106, 108–113, 136–138, 140, 144–146, 152–154, 160, 185–188, 191, 192

### **N**

### **NIST — National Institute of Standards and Technology**

An US government research institution. 30, 86, 161

### **NLP — Natural Language Processing**

Is a research feld with the focus on analyzing and processing natural languages in texts, images, videos, or audio formats. In this thesis, we mainly refer to natural language processing on texts rather than other multimedia formats. 39, 61, 64, 65, 72, 148, 161

### **NN — Neural Network**

A graph network that aims to mathematically mimic biological neural networks. 61

### **O**

### **OCR — Optical Character Recognition**

Is a research feld that focuses on identifying text and other symbols in images or videos. 28, 39, 99

#### **OMDoc — Open Mathematical Document**

Is a markup format developed by Michael Kohlhase [198] to describe mathematical documents. 22, 23, 26, 27, 32, 33, 36

### **OpenMath:**

Is a markup language similar to MathML which uses an XML format to encode semantic information of mathematical expressions. The standard is maintained by the OpenMath Society. See http://openmath.org/ [accessed 2021-10-01] for more information. 6, 7, 19, 21–27, 34–37, 41, 58, 62, 106, 117, 133

### **OPSF — Orthogonal Polynomials and Special Functions**

The set of orthogonal polynomials and special functions. Special functions are functions that, due to their general importance in certain felds, have specifc names and standard notations. Note that there is no formal defnition of the term *special function*. The *NIST Handbook of Mathematical Functions* [276] is a standard resource that covers a comprehensive set of functions (and orthogonal polynomials) that are widely accepted as *special*. 1, 3, 31–33, 35, 93, 101, 105, 111, 112, 114, 133, 140, 141, 145, 154–156, 185

#### **ORES — Objective Revision Evaluation Service**

A system used by Wikipedia to classify edits in potential damaging changes or changes made in good faith [144]. 103–105, 135, 136, 141, 142, 159

### **P**

### **PL — Presentation Language**

Presentation languages are languages that encode mainly visual information, such as L ATEX or presentation MathML. 43, 51

### **pMML — Presentation MathML**

Presentational MathML refers only to the visual encoding of math formulae. For more information see the explanations about MathML. 22, 23, 75–77

#### **POM — Part-of-Math**

Is a LATEX parser developed by Abdou Youssef [402] that tags each token in the parse tree with additional information similar to Part-of-Speech (POS) taggers in natural languages. 28, 32, 38, 52, 56, 93, 94, 110, 111, 151, 153, 155

### **POS — Part-of-Speech**

Part-of-Speech tagging describes the process of tagging words in text with grammatical properties of the word. 45, 109, 160, 161, 185

### **R**

### **Reduce:**

Probably the frst CAS from 1963 by Anthony C. Hearn [151] with a large impact on any other CAS that followed after. Since 2008, Reduce is open-source under the BSD license. 5, 34, 35, 164

### **S**

### **Scientometrics:**

An international journal with an 5-year IF of 3*.*702 for quantitative aspects of the science of science, communication in science and science policy. According to https://academ ic - accelerator . co m /5 - Year - I m pact - Factor / Sciento m etrics [accessed 2021-10-01] it is placed 18 of 227 journals in the feld of library and information sciences. 9, 19, 60

#### **SCSCP — Symbolic Computation Software Composability Protocol**

Is a protocol to communicate mathematical formulae between mathematical software, specifcally CAS. It was developed as part of the *SCIEnce project* funded with 3 Million Euro by the Euorpean Union. More information can be found in the two publications about the project [119, 361]. 24, 26, 35, 36, 58

### **semantic LATEX:**

Refers to mathematical expressions that uses semantic macros developed by B. Miller for the DLMF. Each of these LATEX macros is tied to a specifc defnition in the DLMF. Hence, a semantic LATEX macro represents a unique unambiguous mathematical function as defned in the DLMF. An alternative name for semantic LATEX is content LATEX. ix, xi, 2, 7, 8, 10, 12, 15, 19, 22, 28, 30–33, 35, 38, 58, 93–95, 97–100, 115, 116, 133, 138, 143–146, 149, 152–155, 174

#### **Semantifcation:**

Refers to a process that semantically enhances mathematical expressions. Other authors

may also refer to this via *semantic enrichment* [71, 270, 402]. ix, xi, 7–11, 13, 24, 54, 57–59, 94, 95, 97, 103, 104, 106, 107, 115, 138, 144, 145, 147, 152–157, 160, 161, 193

### **SIGIR — Special Interest Group on Information Retrieval**

A premier annual international conference on research and development in information retrieval (has a CORE rank of A\*). 9, 11, 19

### **SnuggleTeX:**

Is an open source Java program for converting LATEX to XML, mainly MathML. SnuggleTeX is one of the rare converters that ofer a semantic enrichment process to content MathML and the only LATEX to CAS converter(supports Maxima) that is not part of a CAS itself [251]. SnuggleTeX is no longer developed with the most recent version 1.2.2 from 2010. See also https://www2.ph.ed.ac.uk/snuggletex [accessed 2021-10-01]. 2–4, 28, 29, 157, 158

#### **STEM — Science, Technology, Engineering, and Mathematics**

A group of academic disciplines. ix, xi, 2, 20, 27

### **STEX — Semantic TEX**

Semantic extension of TEX developed by Michael Kohlhase [200]. 19, 22, 30, 32, 33

#### **SVG — Scalable Vector Graphics**

An XML vector image format. 38, 49, 51, 52

#### **SymPy:**

An open-source CAS [252] written in Python. 2, 4, 5, 10, 15, 28–30, 34, 35, 146, 149, 154, 164, 174

### **T**

### **t-SNE — t-distributed Stochastic Neighbor Embedding**

Is a statistical method to visualize high-dimensional data in more convenient and easy to analyze one-, two-, or three-dimensional plots. t-SNE uses a nonlinear dimensional reduction method that tries to preserve structural groups of data. The method was frst introduced by Hinton and Roweis [154]. 69, 70

#### **TACAS — Tools and Alg. for the Construction and Analysis of Systems**

TACAS is a forum for researchers, developers and users interested in rigorously based tools and algorithms for the construction and analysis of systems (has a CORE rank of A). 9, 15, 116, 163, 168, 180

#### **TF-IDF — Term Frequency-Inverse Document Frequency**

Is a statistical measure intend to refect the importance of tokens (e.g., words) to a document in a larger corpus. The underlying assumption behind the measure is that frequent tokens across an entire corpus are less important compared to tokens that appear frequently in single documents but rarely somewhere else. The BM25 ranking function bases on the principle of TF-IDF scores. 79, 83, 85, 89, 90

### **TPAMI — Transactions on Pattern Analysis and Machine Intelligence**

An IEEE published top monthly journal with an 5-year IF of 25*.*816 and a focus on pattern analysis and recognition and related felds. According to https://academicaccelerator . co m /5 - Year - I m pact - Factor / jp / IEEE - Transactions - on - Pattern-Analysis-and-Machine-Intelligence [accessed 2021-10-01] it is the top journal in three categories and 2nd in 2 additional categories. 9, 13, 16, 97, 116, 163

**V**

### **VMEXT — Visual Tool for Mathematical Expression Trees**

A visualization tool for mathematical expression trees developed by Schubotz et al. [331]. 37, 46, 49, 50

### **W**

### **W3C — World Wide Web Consortium**

Is an international organization for standards for the world wide web. See www.w3.org [accessed 2021-06-09]. 23, 24

#### **WED — Wolfram Engine for Developers**

Is a free interface for the Wolfram engine (the engine behind Mathematica). Since 2019, this interface allows developers to interact and use most of Mathematica's core features without purchasing a full license. More information are available at https://www.wolf ram.com/engine/ [accessed 2021-09-07] frst. 117, 127, 131

#### **WSDM — Web Search and Data Mining**

A premier conference on web-inspired research involving search and data mining (has a CORE rank of A\*). 9, 97

#### **WWW — The Web Conference**

An annual major conference with the focus on the world wide web (has a CORE rank of A\*). 9, 12, 60

### **X**

### **XML — Extensible Markup Language**

A markup language mainly used for the representation of many diferent data structures. 20, 23–25, 27, 32, 33, 37, 43, 47, 51, 52, 74, 76, 77, 157

#### **XSLT — Extensible Stylesheet Language (SLT) Transformation**

A language to transform XML documents. 23, 24, 26

### **Z**

### **zbMATH — Zentralblatt MATH**

Is an international reviewing service for abstracts and articles in mathematics. zbMATH provide access to the abstracts and reviews of research articles mostly in the feld of pure and applied mathematics, see also https://zbmath.org/ [accessed 2021-10-01]. 13, 73–75, 78–80, 83, 84, 86, 88–90, 92, 144, 148

### **BACK MATTER**

### **Bibliography of Publications, Submissions & Talks**


### **BACK MATTER**





